Data Processing
Providing standardized, quality-assured data products is essential to NEON's mission of providing open data to support greater understanding of complex ecological processes at local, regional and continental scales. At each field site, a diverse suite of biological, physical, chemical and ecological characteristics are measured using three collection systems: automated instruments, observational sampling and airborne remote sensing surveys. The collected data are sent to headquarters for processing and publication of data products for users to use. Available NEON data; supporting metadata; and science design, data collection and data processing documentation are accessible through the NEON Data Portal.
NEON's Data Processing Levels
Data products are processed at progressive levels. Level 1 and higher data are served on the NEON Data Portal and programmatically via the NEON Data API. Level 0 data may be obtained by request.
Data Level | Description |
---|---|
Level 0 (L0) | Raw sensor readings or human-made observations obtained in the field, e.g. the 1 Hz resistance reading of a platinum resistance thermometer, or the identification of individual plant species along a transect. |
Level 1 (L1) | Raw measurements are quality controlled and converted to relevant scientific units (e.g. Ohms to degrees Celsius). Measurements are often averaged to longer temporal or spatial scales and accompanied with aggregation statistics. |
Level 2 (L2) | Temporally interpolated measurements or AOP data provided by flightline. |
Level 3 (L3) | Spatially interpolated or mosaicked measurements, e.g. 1 km tiles of NDVI measurements. |
Level 4 (L4) | The combination of basic measurements and scientific theory to derive higher order quantities. Examples include the computation of stream discharge from surface water elevation and a stage-discharge rating curve, and the exchange of carbon dioxide between the surface and the atmosphere from high frequency wind and gas concentration measurements. |
The Observation Systems (OS) Data Pipeline: From Data Collection to Data Product
NEON processes raw field observations to produce Observation Systems (OS) data products of scientific interest to the community. Documentation of the details for each data product consists of:
- Protocols: The protocols used by field scientists to carry out sampling and measurements.
- Data Product User Guide: A brief summary of the sampling design and the structure of the published data.
- Variables file: Provided with every OS data download, contains variable definitions and units.
- Validation file: Provided with every OS data download, contains validation rules applied to each variable during data entry and ingest.
Controlling Data Entry Quality in the Field
The processing pipeline for NEON OS data emphasizes quality control at the point of data entry. Mobile applications are used for data entry in the field, built on a shared platform but customized for each protocol, allowing for workflows and data validation suited to each protocol. The platform NEON uses is Fulcrum; the application for entering distributed soil sampling data:
Data are submitted to the database both from the mobile applications, on a regular schedule that allows time for post-data collection QA/QC by the field staff, and from spreadsheets generated by external analytical facilities. Before data can be ingested to the database, they must pass a validation step; data validation is customized to each dataset. For further details on quality assurance and quality control at the points of data entry and ingest, see the Data Quality page.
How the Data are Processed
On ingest to the database, OS data are considered Level 0. On a regular schedule, newly ingested data are processed to Level 1. In the OS system, processing at this stage is light, consisting of associating spatial data and sample information, and adding higher-level taxonomic data. In addition, taxonomic data are screened at this step to determine if they contain any sensitive, threatened, endangered, or otherwise listed taxa. If so, the taxonomic identifications are “fuzzed”, meaning that the level of precision is reduced to the lowest non-threatened level. For example, if a species and all other representatives of its genus in a particular location are threatened, NEON data are published at the family level. At the sites within Great Smoky Mountains National Park (GRSM and LECO), listed taxa are redacted from publication entirely.
Some data repositories use a different approach to protection of threatened taxa, by fuzzing the locations at which the taxa were found. This approach is not possible for NEON data due to the highly controlled spatial design required for integration across data products and scaling up.
After processing to Level 1, data are packaged by site and by month and published to the NEON data portal.
Re-processing and Improving Data Quality
Despite NEON’s controlled data entry, at times, errors are found in published data; for example, an analytical lab may adjust its calibration curve and re-calculate past analyses, or field scientists may discover a past misidentification. In these cases, Level 0 data are edited and the data are re-processed to Level 1 and re-published. Published data files include a time stamp in the file name; a new time stamp indicates data have been re-published and may contain differences from previously published data.
Data are subject to re-processing at any time during an initial provisional period; data releases are never re-processed. For more details, see the Data Releases section below.
The Instrument Systems (IS) Data Pipeline: From Sensor Measurement to Data Product
NEON collects and processes raw sensor measurements to produce Instrument Systems (IS) data products of scientific interest to the community. Documentation of the details for each IS data product may consist of:
- Algorithm Theoretical Basis Document (ATBD): A full explanation of the algorithms used to process data. Each ATBD details the scientific theory behind the measurement, relevant processing algorithms, as well as the steps taken to determine uncertainty and to perform quality control/quality assurance.
- Sensor Command, Control, and Configuration (C3) Document: Specifies the command, control, and configuration details for operating the relevant sensor and its assembly. It includes a detailed discussion of all necessary requirements for operational control parameters, conditions/constraints, set points, and any necessary error handling.
- Variables file: Provided with every IS data download except from the bundled eddy covariance product; contains variable definitions and units.
- Sensor position file: Provided with every IS data download except from the bundled eddy covariance product; contains the positions of the sensors relative to a reference location, as well as the reference location coordinates.
Maintaining High Quality Automated Instruments in the Field
An array of sensors at each field site collects raw, Level 0 data 24 hours a day, 7 days a week. As part of the sensor-to-network interface, a datalogger receives measurement data from the sensors and transmits the data to an on-site server. For most sites, these raw Level 0 data are automatically sent back to NEON Headquarters in close to real time and stored in the database. Only aquatic site OKSR is without streaming connectivity, and the data are stored on-site and retrieved at regular intervals. To ensure high quality data with low error rates, NEON engineers and field staff have systems to monitor each sensor’s state of health. These systems can be used to generate a trouble ticket if the instrument’s measurements or operating parameters are out of a predefined range, prompting a NEON technician, engineer, or scientist to investigate the issue.
Processing Automated Instrument Data
The database that stores raw Level 0 data also contains calibration coefficients and associated metadata that are required to process data into higher-level data products. Processing begins by applying the calibration for the specific instrument, determined the last time the sensor was calibrated in the NEON Calibration, Validation and Audit Laboratory. Automated quality control algorithms check for plausible magnitude and variability in addition to other sensor specific tests prior to, and sometimes after applying the relevant scientific algorithms for the data product. See the Data Quality page for more information. Mean values are then calculated for the appropriate averaging interval, which is one and thirty minutes for most NEON instrument data. Algorithms to generate higher-level data products include interpolation over space and time and calculations based on multiple Level 1 data products. These Level 1 and higher data products are also stored in the database.
After processing to Level 1 or higher, data are formatted, compressed, and tagged with the publication date and delivered to the NEON Data Portal.
Why Re-processing of Data May Occur
Despite rigorous algorithm development and quality control procedures used prior to and during data processing, errors and omissions are sometimes discovered after data have been published to the Data Portal. For example, processing may have previously failed for a particular time period, prompting reprocessing for a more complete dataset. Alternatively, field scientists may discover that a sensor was obstructed from measuring its intended target (e.g., sensor covered in dust), prompting investigation and manual flagging of the data post-publication. If the data are reprocessed for any reason, the publication date is updated and the change-log included in the download package is annotated with the reason. Data are subject to re-processing at any time during the provisional period; data releases are never re-processed. For more details, see the Data Releases section below.
The Airborne Observation Platform (AOP) Data Pipeline
NEON collects and processes raw airborne sensor measurements to produce Airborne Observation Platform (AOP) data products of scientific interest to the community. Documentation of the details for each AOP data product may consist of:
- Algorithm Theoretical Basis Document (ATBD): A full explanation of the algorithms used to process data. Each ATBD details relevant data processing algorithms, as well as the steps taken to determine uncertainty and to perform quality control/quality assurance.
- Data processing quality assurance (QA) documents: a summary of the data quality metrics used to assess the validity of the AOP data products, information on flight acquisition parameters, and processing parameters. These documents are delivered with the data products for which they are applicable.
Getting AOP Data From the Plane to the Processing Team
AOP data are collected by four primary sensors mounted onto an airborne platform; 1) a NEON imaging spectrometer (referred to as NIS), 2) full-waveform Light Detection and Ranging (LiDAR) instrument, 3) a high resolution Red-Green-Blue (RGB) camera, and 4) a high accuracy GPS / IMU (inertial measurement unit). In addition to these primary sensors, the AOP also contains support monitoring equipment which provide real-time health metrics of the sensors and metadata used in downstream processing. The aircraft contains its own internal network that allows airborne sensor operators to control the operation of the sensors from centralized computers and facilitates the transfer of raw data measurements to specialized high throughput flight disks. Each flight yields exteremely high volumes of data. When a flight is completed, the raw data (referred to as Level 0 data) is transferred manually from the aircraft to the ‘Hotel Kit’. The Hotel Kit is a NEON custom-built computer which accepts the flight disks and extracts the data to an internal drive where it is quality checked and duplicated to a removable hard drive. The removable hard drive is shipped to NEON’s data center in Denver, CO, where it is ingested to the Google Cloud Storage (GCS) archive. Checksums are performed on each copy of the data throughout the process to ensure data have not been corrupted. Upon successful ingestion to the GCS, AOP’s processing team is automatically notified that the data have arrived and that processing can begin.
What Happens During AOP Data Processing
The AOP processing is executed on a series of high power workstations which facilitate efficient processing of the high data volumes that are typical of AOP collections. Processing begins with AOP scientists downloading the raw data from the GCS to the local drives of the workstations. Both the airborne trajectory and discrete and waveform lidar are initially processed in COTS (Commercial Off The Shelf) software due to the proprietary nature of the raw data formats. After production of the airborne trajectory and initial lidar products, separate data processing pipelines for each sensor are commenced which produce; 1) the higher-level lidar products, 2) the higher level spectrometer products, 3) the L1 waveform lidar data product (learn more about the lidar processing pipelines here). Depending on the volume of data, these processing pipelines range in execution time from hours to several days. Throughout processing, several QA metrics and reports are produced which are assessed by AOP scientists to ensure the validity of the data products. Once the AOP scientist has confirmed data products are complete and have passed QA metrics they will be transferred to the GCS and linked to the data portal.
Why Re-processing May Occur
The algorithms used to produce the AOP data products are being continually assessed to improve their accuracy and quality assessments. When a significant algorithmic upgrade has been realized the AOP data will be reprocessed to ensure compatibility across all years of collected data. Certain AOP data products were reprocessed after the algorithms associated with the calibration of the NIS were improved. Improvements to the NIS calibrations required reprocessing spectrometer-derived data products collected between 2013 and 2016 and these data were updated in early 2020. Similarly, for the 2024 Data Release, the Canopy Height Models were reprocessed with an improved implementation of the pit-free CHM algorithm. Reprocessing may also occur if a quality issue was shown to have affected the data. In this instance, data will be reprocessed immediately to apply the appropriate fix and associated messaging will be posted on the data portal.