Answer big ecological questions using big data skills
May 13, 2015
Interested in learning how to work with big data in R?
Sign up for A Hands-On Primer for Working with Big Data in R: Introduction to Hierarchical Data Formats, LiDAR Data & Efficient Data Visualization on Sunday August 9, 2015 at the ESA Centennial Annual Meeting in Baltimore, MD.
Big ecological questions require big data
To ask and answer ecological questions about changes in diverse environments over large areas and long periods of time requires big data. Big data generally refer to datasets so large and complex that traditional processing applications are inadequate; however, big data present unique challenges and opportunities to the scientific research community. For example, to effectively determine the effects of development on stream water quality across a region, standardized and integrated data are needed to characterize land cover and population changes, among other things.
“Big data generally refer to massive volumes of data not readily handled by the usual data tools and practices and present unprecedented opportunities for advancing science and informing resource management through data-intensive approaches.” -Hampton et al. (2013) Big data and the future of ecology.
Working with big data require specific skills
Working with big data in an efficient way requires a set of skills that are new to many scientists. Data formats designed to handle larger datasets, such as the hierarchical data format (HDF5):
- Provide more efficient ways to store large datasets that might contain thousands to millions of records or hypercubes of images;
- Allow users to store multi-dimensional and heterogeneous datasets needed to answer cross-cutting ecological questions; and
- Provide tools to compress and/or parse data for analysis.
Data with spatial attributes
While big data formats maximize data analysis efficiency, using them involves specific sets of skills and libraries for commonly used tools like R and Python. Other data types, like remote sensing data, which include lidar and hyperspectral imagery, are necessary for measuring changes in land cover and other attributes over broad areas and through time. Working with these data that have spatial attributes requires understanding of:
- Unique spatial and hierarchical data formats;
- Tools and libraries required to work with data - many of which are free and open source; and
- Metadata associated with the data, to ensure that analysis outcomes are scaled and located properly.
Automated and reproducible workflows
Working with big data require automated and reproducible workflows. Crunching through thousands or even millions of data points by hand may take weeks, months or years; repeating this type of manual analysis is difficult and time-consuming. Developing automated workflows that process data using coding tools like R or Python facilitates efficient, reproducible workflows. In addition, many journals now require submission of both data and code prior to publication.
NEON Data Skills at ESA 2015
NEON scientists Leah Wasser, Natalie Robinson, Claire Lunch, Christine Laney, Kate Thibault and Sarah Elmendorf have been building, testing, delivering and improving upon a suite of data tutorials that cover big data topics including:
- Working with time series and spatial data stored in the HDF5 format in R
- Learn about the HDF5 file format using a free HDF5 viewer
- Learn key commands and libraries needed to create and work with HDF5 files in R
- Visualization of time series data stored in HDF5 format in R
- Working with LiDAR-derived raster data in R
- Working with hyperspectral imagery in R
NEON is delivering this content in collaboration with SESYNC and Data Carpentry as a full-day pre-conference workshop at the 2015 ESA Annual Meeting. Learn about the half-day workshop at ESA 2014 that prompted 2015’s full-day workshop.
NEON scientists are also hosting a free lunchtime Going 'On the Grid' Spatial Data workshop on Thursday August 13, 2015. This workshop covers issues of uncertainty when converting vector point data to raster or gridded formats. As scientists, many of the observation data we work with are for specific point locations on the ground. However, we often want to interpolate our observations continuously across larger areas, a process sometimes called “gridding”. NEON will lead a discussion and live demonstration that explains how different gridding methods can yield different results in the output rasters--and more importantly, how that might impact the results of your data analysis.