Tutorial

Exploring diel carbon flux cycles

Authors: [David Durden]

Last Updated: Apr 20, 2023

This data tutorial provides an overview of exploring NEON carbon flux data, using the neonUtilities R package. If you are just starting to work with NEON data it is recommended that you also checkout the more general tutorials, such as the neonUtilities tutorial or the Introduction to working with NEON eddy flux data for a more flux centric introductory tutorial. This tutorial builds off the data exploration techniques described in the aforementioned tutorials, and will explore carbon dynamics at two forested ecosystem sites with very different land management practices in NEON's Domain 5 - Great Lakes, STEI and TREE. The STEI and TREE sites are only located approximately 1 mile from each, so the observed differences in fluxes can be attributed to land management practices associated with the forests.

This tutorial assumes general familiarity with eddy-covariance data, and will briefly touch discuss both turbulent (turb) and storage (stor) fluxes in relation to the net surface atmosphere exchange (nsae). The nsae for CO₂ fluxes often referred to as Net Ecosystem Exchange (NEE).

1. Setup

Start by installing and loading packages and setting options. The function below will evaluate if the required packages (packReq) already exist in you R library, and will install and load these packages. The rhdf5 package is hosted on Bioconductor requiring a BiocManager to be installed to allow rhdf5 to be pulled from the Bioconductor repository:

#Required R packages
packReq <- c("BiocManager", "rhdf5", 'neonUtilities', 'ggplot2','tidyverse', "lubridate")

#Install and load all required packages
lapply(packReq, function(x) {
  if(require(x, character.only = TRUE) == FALSE) {
    install.packages(x)
    library(x, character.only = TRUE)
  }})

options(stringsAsFactors=F)

Now that all the required packages are loaded, let's define all the necessary function arguments to download and stack our flux data using the neonUtilities package's zipsByProduct() and stackEddy() functions. I will define the input arguments as variables to make it easy to modify if I want to explore additional sites or months of data. For this analysis, we will focus on the 2022 growing season at STEI and TREE. Please be aware that the data download size is ~ 1 GB. The code can be modified to shorten or extend this window, but to get a good look at the carbon dynamics at the site it was important to capture most of the growing season.

Inputs to the zipsByProduct() function:

dpID: DP4.00200.001, the bundled eddy covariance product
package: basic (the expanded package has additional quality metrics and advanced footprint matrices that are not relevant to this tutoiral)
site: STEI and TREE
startate: 2022-04
enddate: 2022-09
savepath: modify this to something logical on your machine or use the R tempdir()
check.size: TRUE if you want to see file size before downloading, otherwise FALSE

The download may take a while, especially if you're on a slow network. For faster downloads, consider using an API token.

#Target dates
startDate <- "2022-04"
endDate <- "2022-09"

#Site
site <- c("STEI", "TREE")

#File directory
dirFile <- c(tempdir(),"/home/ddurden/eddy/tmp/tutorial")[2]

zipsByProduct(dpID="DP4.00200.001", package="basic", 
              site=site, 
              startdate=startDate, enddate=endDate,
              savepath=dirFile, 
              check.size=FALSE)

2. Stacking Level 4 Flux Data

There are five levels of data contained in the eddy flux bundle. For full details, refer to the NEON algorithm document.

In this tutorial we will only be focusing on Level 4 (dp04) flux data products; however, additional data products from Level 0' (dp0p) calibrated raw data to Level 3 (dp03) spatially interpolated vertical profiles used to derive the storage flux are available in the EC bundled HDF5 files. Information can be found on the NEON algorithm document webpage. The Introduction to working with NEON eddy flux data tutorial dives into additional detail regarding the other data product levels, but this tutoiral will focus exclusively on flux data.

To extract the dp04 data from the HDF5 files and merge them into a single table, we'll use the stackEddy() function. We provide the function to input arguments, filepath and level. The filepath will be the file directory (dirFile) used for the data download via zipsByProduct() combined with the filestoStack00200 folder created by the function. To grab just the flux data products level = dp04:

flux <- neonUtilities::stackEddy(filepath=paste0(dirFile,"/filesToStack00200"),
                  level="dp04")

We now have an object called flux. It's a named list containing four tables: one table for each site's data, and variables and objDesc tables. One of the advantages to the HDF5 files is all the metadata and data descriptions can be compiled into file as additional data tables or as attributes. The stackEddy() function stacks this contextual data in the variables and objDesc tables to help you interpret the column headers in the data table (though it's not comprehensive currently).

The terms of interest for our analysis in dp04 the three flux quantities under fluxCo2:

turb: Turbulent flux
stor: Storage
nsae: Net surface-atmosphere exchange

Note the variables table contains the units for each field:

flux$variables

##    category   system variable             stat           units
## 1      data  fluxCo2     nsae          timeBgn              NA
## 2      data  fluxCo2     nsae          timeEnd              NA
## 3      data  fluxCo2     nsae             flux umolCo2 m-2 s-1
## 4      data  fluxCo2     stor          timeBgn              NA
## 5      data  fluxCo2     stor          timeEnd              NA
## 6      data  fluxCo2     stor             flux umolCo2 m-2 s-1
## 7      data  fluxCo2     turb          timeBgn              NA
## 8      data  fluxCo2     turb          timeEnd              NA
## 9      data  fluxCo2     turb             flux umolCo2 m-2 s-1
## 10     data  fluxH2o     nsae          timeBgn              NA
## 11     data  fluxH2o     nsae          timeEnd              NA
## 12     data  fluxH2o     nsae             flux           W m-2
## 13     data  fluxH2o     stor          timeBgn              NA
## 14     data  fluxH2o     stor          timeEnd              NA
## 15     data  fluxH2o     stor             flux           W m-2
## 16     data  fluxH2o     turb          timeBgn              NA
## 17     data  fluxH2o     turb          timeEnd              NA
## 18     data  fluxH2o     turb             flux           W m-2
## 19     data fluxMome     turb          timeBgn              NA
## 20     data fluxMome     turb          timeEnd              NA
## 21     data fluxMome     turb         veloFric           m s-1
## 22     data fluxTemp     nsae          timeBgn              NA
## 23     data fluxTemp     nsae          timeEnd              NA
## 24     data fluxTemp     nsae             flux           W m-2
## 25     data fluxTemp     stor          timeBgn              NA
## 26     data fluxTemp     stor          timeEnd              NA
## 27     data fluxTemp     stor             flux           W m-2
## 28     data fluxTemp     turb          timeBgn              NA
## 29     data fluxTemp     turb          timeEnd              NA
## 30     data fluxTemp     turb             flux           W m-2
## 31     data     foot     stat          timeBgn              NA
## 32     data     foot     stat          timeEnd              NA
## 33     data     foot     stat      angZaxsErth             deg
## 34     data     foot     stat         distReso               m
## 35     data     foot     stat    veloYaxsHorSd           m s-1
## 36     data     foot     stat    veloZaxsHorSd           m s-1
## 37     data     foot     stat         veloFric           m s-1
## 38     data     foot     stat distZaxsMeasDisp               m
## 39     data     foot     stat      distZaxsRgh               m
## 40     data     foot     stat      distZaxsAbl               m
## 41     data     foot     stat       distXaxs90               m
## 42     data     foot     stat      distXaxsMax               m
## 43     data     foot     stat       distYaxs90               m
## 44     qfqm  fluxCo2     nsae          timeBgn              NA
## 45     qfqm  fluxCo2     nsae          timeEnd              NA
## 46     qfqm  fluxCo2     nsae           qfFinl              NA
## 47     qfqm  fluxCo2     stor           qfFinl              NA
## 48     qfqm  fluxCo2     stor          timeBgn              NA
## 49     qfqm  fluxCo2     stor          timeEnd              NA
## 50     qfqm  fluxCo2     turb          timeBgn              NA
## 51     qfqm  fluxCo2     turb          timeEnd              NA
## 52     qfqm  fluxCo2     turb           qfFinl              NA
## 53     qfqm  fluxH2o     nsae          timeBgn              NA
## 54     qfqm  fluxH2o     nsae          timeEnd              NA
## 55     qfqm  fluxH2o     nsae           qfFinl              NA
## 56     qfqm  fluxH2o     stor           qfFinl              NA
## 57     qfqm  fluxH2o     stor          timeBgn              NA
## 58     qfqm  fluxH2o     stor          timeEnd              NA
## 59     qfqm  fluxH2o     turb          timeBgn              NA
## 60     qfqm  fluxH2o     turb          timeEnd              NA
## 61     qfqm  fluxH2o     turb           qfFinl              NA
## 62     qfqm fluxMome     turb          timeBgn              NA
## 63     qfqm fluxMome     turb          timeEnd              NA
## 64     qfqm fluxMome     turb           qfFinl              NA
## 65     qfqm fluxTemp     nsae          timeBgn              NA
## 66     qfqm fluxTemp     nsae          timeEnd              NA
## 67     qfqm fluxTemp     nsae           qfFinl              NA
## 68     qfqm fluxTemp     stor           qfFinl              NA
## 69     qfqm fluxTemp     stor          timeBgn              NA
## 70     qfqm fluxTemp     stor          timeEnd              NA
## 71     qfqm fluxTemp     turb          timeBgn              NA
## 72     qfqm fluxTemp     turb          timeEnd              NA
## 73     qfqm fluxTemp     turb           qfFinl              NA
## 74     qfqm     foot     turb          timeBgn              NA
## 75     qfqm     foot     turb          timeEnd              NA
## 76     qfqm     foot     turb           qfFinl              NA

3. Plotting CO₂ Flux Data

Let's look at some data! First, we will combine the data from the two sites into a single dataframe with an additional factor variable for the site to allow us to utilize tidyverse and ggplot2 packages for some data munging and to create some interesting data exploration visualizations. In this tutorial we utilize the piping operater %>% from the dplyr to allow us to quickly subset, filter, sort, and plot data using just our combined dataframe dfFlux. We start by looking at the turbulent CO₂ flux timeseries at TREE with the points colored by qfFinl:

flux$STEI$Site <- "STEI"
flux$TREE$Site <- "TREE"

dfFlux <- rbind.data.frame(flux$STEI, flux$TREE)
dfFlux$Site <- as.factor(dfFlux$Site)

dfFlux %>% 
    ggplot(aes(timeBgn, data.fluxCo2.turb.flux, color=factor(qfqm.fluxCo2.turb.qfFinl) )) +
    geom_point() + 
    scale_color_brewer(palette="Set2", name="qfFinal") +
    facet_grid(~Site)

When we plot the data turbulent CO₂ flux data, we see a good agreement between the STEI and TREE sites. We also see some outliers that look to be due to erroneous data, especially at the STEI site. Fortunately, it looks like final quality flag (qfFinal = qfFinl) caught the largest outliers, and we can use it to remove these data from the analysis. Let's first dig into a little background information about NEON bundled EC quality flags.

Quality flags

The NEON Surface-atmosphere exchange (SAE) data products all come accompanied by a qfFinal. Information the defining quality flag (qf), quality metric (qm), and the framework to derive qfFinal are detailed in NEON EC bundled data products algorithm document and NEON Quality Flags and Quality Metrics for TIS Data Products algorithm document. Ultimately, for the dp04 flux data products (e.g. data.fluxCo2.turb.flux), qfFinal of dp01 constituent data streams derived from plausibility tests (i.e. range test, step test, spike test, and persistence test) and sensor diagnostic flags, fluxCo2 = rtioMoleDryCo2 and veloZaxsErth (dp01 data outside the scope of tutorial), are propagated and combined with theoretical eddy-covariance specific quality tests (i.e., stationarity, integral turbulence characteristics tests) to determine whether the data product is flagged as valid (qfFinal = 0) or invalid (qfFinal = 1).

Let's summarize our qfqm.fluxCo2.turb.qfFinl to see the total flagged percentage at our 2 sites.

dfFlux %>% group_by(Site) %>% summarise(mean(qfqm.fluxCo2.turb.qfFinl))

## # A tibble: 2 x 2
##   Site  `mean(qfqm.fluxCo2.turb.qfFinl)`
##   <fct>                            <dbl>
## 1 STEI                             0.219
## 2 TREE                             0.261

Now, lets plot the qfFinal failed percentage for all our flux data products:

dfFlux %>% 
  select(contains("qfqm")) %>% 
    pivot_longer(cols = everything(), names_to = "var") %>% 
      group_by(var) %>%  
        summarise(mean_var = mean(value) * 100) %>% 
          ggplot(aes(x = var, y = mean_var, fill = var)) + 
          geom_col() + 
          guides(x = guide_axis(angle = 90)) + 
          labs(x="Variable", y="Percent qfFinal failed") + 
          scale_fill_brewer(palette="RdYlBu")

Removing data with failed qfFinal

We see that qfqm.fluxCo2.turb.qfFinl is flagged ~ 20-25% over the growing season; however, the qfqm.fluxCo2.stor.qfFinl and qfqm.fluxCo2.nsae.qfFinl are flagged quite a bit more due to the complexity of the system and propagation of issues between sub-components to the overall system (additionally, we currently employ a conservative quality control for the storage fluxes). Now that we have evaluated the qfFinal for all the fluxCo2 data streams, lets remove flagged data from our dataframe.

dfFlux %>% select(contains("qfqm") & contains("fluxCO2"))  %>% summarise_each(sum)

##   qfqm.fluxCo2.nsae.qfFinl qfqm.fluxCo2.stor.qfFinl qfqm.fluxCo2.turb.qfFinl
## 1                     9322                     7886                     3526

dfFlux %>% select(contains("data") & contains("fluxCO2")) %>% summarise_each(funs(sum(is.na(.))))

##   data.fluxCo2.nsae.flux data.fluxCo2.stor.flux data.fluxCo2.turb.flux
## 1                   3174                   2334                   1366

dfFlux$data.fluxCo2.turb.flux[(which(dfFlux$qfqm.fluxCo2.turb.qfFinl== 1))] <- NaN
dfFlux$data.fluxCo2.stor.flux[(which(dfFlux$qfqm.fluxCo2.stor.qfFinl== 1))] <- NaN
dfFlux$data.fluxCo2.nsae.flux[(which(dfFlux$qfqm.fluxCo2.nsae.qfFinl== 1))] <- NaN
   
dfFlux %>% select(contains("qfqm") & contains("fluxCO2"))  %>% summarise_each(sum)

##   qfqm.fluxCo2.nsae.qfFinl qfqm.fluxCo2.stor.qfFinl qfqm.fluxCo2.turb.qfFinl
## 1                     9322                     7886                     3526

dfFlux %>% select(contains("data") & contains("fluxCO2")) %>% summarise_each(funs(sum(is.na(.))))

##   data.fluxCo2.nsae.flux data.fluxCo2.stor.flux data.fluxCo2.turb.flux
## 1                   9713                   7913                   4343

We see from the summary of the fluxCo2 qfFinal and data NA's, that we have removed all the data from data.fluxCo2.turb.flux, data.fluxCo2.turb.flux, and data.fluxCo2.turb.flux. Now we have a clean data set to begin our diel flux cycle data analysis.

3. Plotting the diel cycle of CO₂ fluxes

Let's look at the data in a different way, the diel cycle of carbon and water vapor are important when evaluating carbon sequestration and ecosystem health. We can evaluate the diel cycle by binning the data by hour in the day, this is relatively straightforward using ggplot and the boxplot function (geom_boxplot()).

dfFlux$hour <- factor(lubridate::hour(dfFlux$timeBgn))

ggplot(dfFlux, aes(x = hour, y = data.fluxCo2.turb.flux, fill = Site)) +
  geom_boxplot() +
  stat_summary(fun = median, geom = 'line', aes(group = Site, colour = Site))

Nice! We get a glimpse of daily carbon dynamics, but something doesn't quite look right.

Note the timing of C uptake; the UTC time zone is clear here, where uptake occurs at times that appear to be during the night. Note that all NEON data use UTC time, aka Greenwich Mean Time. This is true across NEON's instrumented, observational, and airborne measurements. When working with NEON data, it's best to keep everything in UTC as much as possible, otherwise it's very easy to end up with data in mismatched times. We may want to convert the time to Local standard time (LST). The metadata attributes at the site level in the bundled EC HDF5 files contains fields for ZoneTime and TimeDiffUtcLt. We can use the rhdf5 package h5readAttributes() function to grab the metadata. In this analysis, both the STEI and TREE are in the same timezone. Therefore, we will only grab metadata from one file and apply to both sites.

fileMeta <- list.files(dirFile, pattern = paste0(".*",site[1],".*.h5"), recursive = TRUE, full.names = TRUE)[1]

siteMeta <- h5readAttributes(fileMeta, name = site[1])

siteMeta 

## $DistZaxsCnpy
## [1] 9.5
## 
## $DistZaxsDisp
## [1] "6.2"
## 
## $DistZaxsGrndOfst
## [1] "0.758"
## 
## $DistZaxsLvlMeasTow
## [1] "0.65"  "4.79"  "8.15"  "12.30" "15.66" "22.36"
## 
## $DistZaxsTow
## [1] "21.34"
## 
## $ElevRefeTow
## [1] "475.7317"
## 
## $LatTow
## [1] 45.50895
## 
## $LonTow
## [1] -89.58636
## 
## $LvlMeasTow
## [1] 6
## 
## $`Pf$AngEnuXaxs`
## [1] 0.01544
## 
## $`Pf$AngEnuYaxs`
## [1] -0.019366
## 
## $`Pf$Ofst`
## [1] 0.05764
## 
## $TimeDiffUtcLt
## [1] -6
## 
## $TimeTube
## [1] 10.1 11.1 10.4 11.3 11.3 12.2
## 
## $TypeEco
## [1] "Aspen dominated regenerating forest"
## 
## $ZoneTime
## [1] "CST"
## 
## $ZoneUtm
## [1] "16"

dfFlux$timeBgnLst <- dfFlux$timeBgn + lubridate::hours(siteMeta$TimeDiffUtcLt)
dfFlux$hourLst <- factor(lubridate::hour(dfFlux$timeBgnLst))

ggplot(dfFlux, aes(x = hourLst, y = data.fluxCo2.turb.flux, fill = Site)) +
  geom_boxplot() +
  stat_summary(fun = median, geom = 'line', aes(group = Site, colour = Site)) +
  scale_y_continuous(limits = quantile(dfFlux$data.fluxCo2.turb.flux, c(0.001, 0.999), na.rm = TRUE))

That looks better, now we can clearly see the diel cycle of the turbulent CO₂ flux at our STEI and TREE sites with peak carbon uptake around noon. The carbon uptake should align with peak solar angle and incoming photosynthetically active radiation (PAR). The boxplot also provides us information about the variability of the fluxes throughout the day across the growing season with interquartile range (IQR) represented by the box and the whiskers indicating the threshold for outliers (black dots) as 1.5 * IQR subtracted from 1st quantile (Q1) and added to 3rd quantile (Q3):

Q1 - 1.5 * IQR

Q3 + 1.5 * IQR

In the boxplots above the outlier points were greatly reduced after the qfFinal data removal; however, some outliers remained. To focus our attention on the bulk of the data, we scaled the y-axis using scale_y_continuous() in conjunction with the quantile() function to look at 99.8% of the data excluding extreme outliers that may have remained after our filtering. We know that the turbulent flux is usually the dominant flux component to NEE, but the storage flux can be significant at forested sites. Let's have a look at the storage flux and it's impact on the NSAE flux at STEI and TREE:

ggplot(dfFlux, aes(x = hourLst, y = data.fluxCo2.stor.flux, fill = Site)) +
  geom_boxplot() +
  stat_summary(fun = median, geom = 'line', aes(group = Site, colour = Site)) +
  scale_y_continuous(limits = quantile(dfFlux$data.fluxCo2.stor.flux, c(0.001, 0.999), na.rm = TRUE))

The storage flux as expected is quite smaller than the turbulent flux, and is dominant during the nighttime hours and evening/morning transition times when turbulence is generally weaker and the trees and understory vegetation aren't photosynthesizing. It's interesting to note the stronger stronger signal at the TREE site.Let's look how that impacts the overall NSAE flux:

ggplot(dfFlux, aes(x = hourLst, y = data.fluxCo2.nsae.flux, fill = Site)) +
  geom_boxplot() +
  stat_summary(fun = median, geom = 'line', aes(group = Site, colour = Site)) +
  scale_y_continuous(limits = quantile(dfFlux$data.fluxCo2.nsae.flux, c(0.001, 0.999), na.rm = TRUE))

When we look at the NSAE flux we see that the STEI site appears to sequester more carbon, which makes sense considering the STEI site is predominately a young Aspen stand for pulpwood that is growing at a fast rate. It's difficult to get a good total Carbon uptake estimate without performing some more advanced gap-filling and partitioning to better understand the ecosystem carbon dynamics. We can simply look at the relative cumulative flux contributions:

dfFlux %>% 
  select(contains("data.fluxCo2") | "Site") %>% 
      group_by(Site) %>% 
        summarise_each(funs(mean(., na.rm = TRUE)))

## # A tibble: 2 x 4
##   Site  data.fluxCo2.nsae.flux data.fluxCo2.stor.flux data.fluxCo2.turb.flux
##   <fct>                  <dbl>                  <dbl>                  <dbl>
## 1 STEI                   -2.72                -0.0259                  -3.83
## 2 TREE                   -2.54                -0.0911                  -3.14

As expected from the plots, we see that on average the STEI site takes up more carbon on average across the growing season when we look at nsae and turb CO₂ fluxes. This obviously comes with the tradeoff that the trees are managed for cultivation in the future; whereas, TREE is currently managed as an educational resource that is likely to be preserved, which has other ecosystem health benefits.

This analysis is meant to provide a glimpse into NEON flux data products, while focusing primarily on CO₂. This workflow can easily be adapted to look at Sensible and Latent heat (H₂O) fluxes, or be adapted to evaluate additional NEON sites or years of data.

Get Lesson Code

flux-diel-cycle.R