Big Data Part II: Sharing the Challenges and Payoffs of Big Data
September 5, 2012
Part II: Sharing the Challenges and Payoffs of Big Data
By Brian Wee, Chief of External Affairs
Big Data is not new to the science world. But to extract as much fundamental insight and predictive power from ecological Big Data as we have from large data sets in disciplines like physics, genomics, and atmospheric science, we need different and more sophisticated tools. One of the biggest challenges and opportunities that NEON continues to grapple with is to find, develop, and implement the best and necessary tools for an ecological Big Data project of continental scale. Ecosystems are rich with subtle and varied interactions that cannot be teased out from the noise of variability without a large suite of correlated measurements that capture a “snapshot” of the environment. If we want to be able to filter out noise and derive a new understanding of the ecosystem processes underlying our measurements, we need to capture many snapshots over time and space. Weather data typically includes automated measurements of about a dozen variables, many of which can be related to each other using fundamental physical laws. NEON, on the other hand, will observe more than 500 variables at each of its 60 sites for 30 years, a necessary breadth and depth to achieve ecological insight and forecasting across ecosystems when the relationships between such variables are complex and may have yet to be discovered or described.
In addition, data collection in meteorological networks and the Large Hadron Collider is automated; but many ecologically important samples and measurements must still be collected by hand in varying conditions. A great deal of procedural standardization and QA/QC is necessary to ensure that billions of data points collected by thousands of sensors and hundreds of people at 60 sites over 30 years are of consistent and high quality. Furthermore, each data point, whether it be a leaf nitrogen concentration, spectral reflectance or taxonomic identification of a phytoplankton, must be documented with time, place and quality control information as well as a link to the protocol that generated that measurement. These metadata exponentially increase the dimension and complexity of ecological Big Data. But they are utterly necessary; without the protocols for data collection, for instance, it is impossible for a data user to assess the accuracy and usability of the data or quantify its uncertainty in modeling applications. Making Big Data work for ecology and environmental science requires an enormous investment of effort, resources and ingenuity. But the potential payoff is equally large. Giving researchers large data sets and the tools required to integrate them into novel analyses opens up a world of discoveries and insights that the original data collectors may never have imagined. What’s more, bigger and more integrated data sets make a new type of ecological science possible by enabling the direct quantitative testing of general theory and hypotheses. These long-term, large-scale data sets make it possible to ask and answer simple, broad questions like “what are the biological consequences of environmental change?” rather than “does soil nitrogen appear to be related to nighttime temperatures in the Costa Rican rainforest?” The answers to these broader questions also have broader implications for humanity. NEON data products, for instance, can be used in models to forecast the responses of ecosystems, and ecosystems provide the human essentials of food, fiber, energy and water. These essentials and other ecosystem services link the nation’s environmental well-being to its economic success. Thus, “government has an essential role to play in the stewardship of environmental capital,” as the President’s Council of Advisors for Science and Technology (PCAST) asserts in its 2011 report, “Sustaining Environmental Capital: Protecting Society and the Economy.” Effective management of environmental capital requires accessible, high-quality information about the current state of the environment and its likely responses to change. To that end the 2011 PCAST report recommends improving integration and utilization of existing data and models and filling gaps in the data (EcoINFORMA), as well as increasingly employing ecoinformatics to improve decision-making in natural resources management. The report also identifies NEON, the Long-Term Ecological Research Network (LTER), and observation programs from other Federal agencies as contributors to a body of credible data on the status and trend of the nation’s ecosystems that can be used to inform national assessments and decisions. This large and rapidly expanding body of information calls for enhanced tools to make better use of it. Thus, the U.S. government has invested in the development of technologies and infrastructure to enhance the accessibility and utility of existing and future Big Data. The $200 million Big Data Research and Development Initiative announced earlier this year funds several such investments including a joint National Science Foundation (NSF) / National Institutes of Health solicitation for proposals to “advance the core scientific and technological means of managing, analyzing, visualizing, and extracting useful information from large and diverse data sets.” CIF21, a complementary cyberinfrastructure initiative, aims to develop and coordinate efforts across NSF to create a distributed computing and information processing framework to transform data acquisition, storage, management and integration across Federal agencies. CIF21 further spawned the EarthCube Initiative to support the growth of data sharing and analysis across geoscience and biological science disciplines. Indeed, a fertile ecosystem of cross-pollinating data initiatives is currently buzzing away at key tasks to prepare researchers for the growing demands of large-scale, interdisciplinary environmental science. For instance, the DataONE project, the EarthCube Initiative and the Federation of Earth Science Information Partners (of which NEON is a member) are working to define better ways to publish and discover data. Another hive of activity is clustered around the task of refining semantic links between data points that enable sophisticated, powerful queries to help deal with the complex web of inter biological, physical, behavioral and phylogenetic data exemplified by the pika example in the first part of this series. Still other organizations are experimenting with cloud-based, collaborative environments to help manage the workflows of large, multidisciplinary science projects such as Macrosystems Biology research. NEON is not alone in facing the challenges of ecological Big Data; it is one of many projects that are tackling these challenges in parallel and together.
Sandra Chung and Dave Schimel contributed to this piece.