Tutorial

Access NEON Data for Metagenomics

Authors: Hugh Cross, Kaye Shek

Last Updated: Aug 2, 2023

The purpose of this tutorial is to introduce users to accessing NEON data for expanding metagenomic analyses. The tutorial is being run in conjunction with the workshop "FAIR Data and NEON Data discovery at the National Microbiome Data Collaborative", presented at the annual meeting of the Ecological Society of America. The purpose of this component is to provide a brief introduction to how to download NEON data, with a focus on those NEON data that can be used as metadata for the NMDC metagenomic analyses of NEON samples. We will provide some brief examples of how to download relevant NEON data for soil and aquatic samples, and then how to wrangle the data to link them to NEON metagenomic samples that have been run through the NMDC Edge pipeline.

Learning Objectives

After completing this tutorial you will be able to:

Access information on NEON data relevant to metagenomic research.
Download NEON data for soil and aquatic samples.
Process NEON sample data to be able to analyze with metagenomic data

Things You’ll Need To Complete This Tutorial

R Programming Language

You will need a current version of R to complete this tutorial. We also recommend the RStudio IDE to work with R.

R Packages to Install

Prior to starting the tutorial ensure that the following packages are installed.

neonUtilities: Basic functions for accessing NEON data
neonOS: Functions for common data wrangling needs for NEON observational data
tidyverse: A collection of R packages designed for data science
respirometry A package that includes a function to average pH data

All of these packages can be installed from CRAN:

install.packages("neonUtilities")

install.packages("neonOS")

install.packages("tidyverse")

More on Packages in R – Adapted from Software Carpentry.

Working Directory

This lesson assumes that you have set your working directory to the location of the downloaded and unzipped data subsets.

An overview of setting the working directory in R can be found here.

Additional Resources

This tutorial will provide some basic examples for for finding information and downloading NEON data. For more details on downloading NEON data, the "Download and Explore NEON Data" tutorial provides an overview of downloading data using the Data Portal and the neonUtilities R package, and this tutorial goes into detail on the neonUtilities package and how the each function works. Most of the examples provided here utilize the R programming language, and NEON provides a tutorial series on basic R skills to get you started.

The NEON Data Portal

NEON provides a wealth of data to assist with your research. The NEON Data Portal is the best place to start looking for the data you want. There is a lot of information on this page. For now, we will have a quick tour.

There are many ways to search the Data Portal, including searching by site. The page includes an interactive map.

The NEON Data Portal page includes an interactive map.

Explore Data Products

From the Data Portal main page, Click on the EXPLORE button in the box called "Explore Data Products", or just go to this link, and you can search by key words or data product codes. For example, type 'metagenomics' in the Search field, and then scroll down through the results.

Click on the link to Soil microbe metagenome sequences and the information for this data product will open in a new page. On this page there is a lot of useful information, including a description of the tables are included for the data, links to documents, a Quick Start guide, and much more. You of course are able to download data from this page, and can select specific sites and time ranges in an interactive table. However, in this tutorial we are going to use the neonUtilities R package to access the data.

NEON metadata for metagenomic data

Downloading data with neonUtilities

The metagenomics data available on the NEON website includes only the raw sequence files. The sample data available through the NMDC website have been processed through the NMDC Edge pipeline, and include taxonomic and functional genomic information. However, the NEON samples collected for metagenomic sequencing were also subjected to a wide range of measurements, including carbon and nitrogen isotopes, soil temperature, and pH. The following examples will help you to annontate the functional and taxonomic information of NEON samples on the NMDC data portal so they can be analyzed along with all other NEON data from the soil and aquatic samples.

We will start by accessing some soil chemical and physical measurements using the neonUtilities package. Though we will be using R to do this, it is still useful to look up the information on the NEON Data Portal. Go back to the Explore Data Products page, reset all filters, then type in "Soil physical and chemical properties" in the Search bar. Scroll down and click on "Soil physical and chemical properties, periodic" (DP1.10086.001).

With the information from the Data Portal, we can download this data product using the neonUtilities package. We will start with a subset of terrestrial sites. Go ahead and set up the following command in a text file in RStudio or text editor.

library(neonUtilities)



soilTrialSites = c("BONA","DEJU","HEAL","TOOL","BARR")



soilChem <- loadByProduct(

  dpID='DP1.10086.001',

  startdate = "2017-01",

  enddate = "2019-12",

  check.size = FALSE,

  site = soilTrialSites,

  package='expanded')

For full details on the loadByProduct() function, see the 'Use the neonUtilities Package' tutorial. Here we will just note some of the parameters. The dpID parameters is taken right from the Data Portal page for that data product. The startdate and enddate define the time range, and for the site parameter, we can enter a list of the four-letter codes for each site. The check.size we are leaving as FALSE, to prevent the function from warning us before big downloads. For example, if you do not specify a time range with startdate/enddate or define the sites to download using the site parameter, this will be a much bigger download and it is a good idea to leave this option at TRUE. If you are incorporating this code into a script as part of a pipeline, for example, then you should set this at FALSE.

Viewing the data

Once you have run this code and downloaded the data, we can have a look. The object returned by loadByProduct() function is a named list of data frames. To view or access each one, you can select them from the list of data frames using the $ operator. For example, if we would like to view the table sls_soilChemistry table in RStudio, then we can use the View() function:

View(soilChem$sls_soilChemistry)

You will then see a table open in a new tab in RStudio. You can see a lot of information in this table. This table contains multiple measurements of soil chemistry data that could be useful for metagenomics analysis. These include carbon and nitrogen isotopes (organic13C, d15N), percent C and N (organicCPercent, nitrogenPercent), and ratio of carbon to nitrogen (CNratio).

Take some time on your own to explore the other tables with this download.

Wrangling sample data for metagenomic samples

Now that you are familiar with downloading and viewing data through the neonUtilities package, we can start to focus on linking the data to the metagenomic samples. For soil samples, the DNA samples for metagenomics are combined from individual soil samples. This is why you will not see any of the chemistry data we just viewed corresponding to any metagenomic DNA sample directly. We need to access the list of samples that were combined to create a soil composite sample, so we can link these data directly. For this we look at the sls_metagenomicsPooling table.

View(soilChem$sls_metagenomicsPooling)

In this table the list for each composite DNA sample is in the column genomicsPooledIDList. For example, if we want to look at the list of samples used for the composite sample 'TOOL_043-O-20170719-COMP' (sample is in the genomicsSampleID field), we can pull up the list for the 11th sample in the table:

soilChem$sls_metagenomicsPooling$genomicsPooledIDList[11]

# you can check to confirm the first sample is TOOL_043-O-20170719-COMP

soilChem$sls_metagenomicsPooling[11,'genomicsSampleID']

The content of this field is a list demarcated by the pipe symbol ("|"):

"TOOL_043-O-5-7-20170719|TOOL_043-O-7.5-31-20170719|TOOL_043-O-32.5-27-20170719"

From the list you can see that there are three samples for the composite sample (there can be anywhere from 1-3 samples for each composite), all from plot TOOL_043 collected on 7/19/2017, and all from the organic soil horizon. Now we can observe the chemical measurements for those specific samples. We will use tidyverse to help get the fields from the table. Let's take the first one as an example:

library(tidyverse)



View(soilChem$sls_soilChemistry %>%

  filter(sampleID == "TOOL_043-O-5-7-20170719") %>%

  select(sampleID,d15N,organicd13C,CNratio))

We would like to be able to get these measurements for all the metagenomic subsamples. For this, we will first have to get the list for each composite sample and create a new table.

# split up the pooled list into new columns

genomicSamples <- soilChem$sls_metagenomicsPooling %>%

  tidyr::separate(genomicsPooledIDList, into=c("first","second","third"),sep="\\|",fill="right") %>%

  dplyr::select(genomicsSampleID,first,second,third)

# now view the table

View(genomicSamples)

Now we will adjust the table so that each sampleID is a row, with the genomicsSampleID listed for each sample:

genSampleExample <- genomicSamples %>% 

  tidyr::pivot_longer(cols=c("first","second","third"),values_to = "sampleID") %>%

  dplyr::select(sampleID,genomicsSampleID) %>%

  drop_na()

# now view

View(genSampleExample)

Now that you have all samples for each metagenomic sample listed, you can easily combine this table with other tables, using the sampleID. As an example we will go back to the soil chemistry data:

chemEx <- soilChem$sls_soilChemistry %>%

  dplyr::select(sampleID,d15N,organicd13C,nitrogenPercent,organicCPercent)



## now combine the tables 

combinedTab <- left_join(genSampleExample,chemEx, by = "sampleID") %>% drop_na()



View(combinedTab)

Merging metadata around composite samples

We now have a table that includes the genetic subsamples and their corresponding biogeochemical measurements. However, if we want to compare the biogeochemical data directly with the metagenomic genomic samples, you may want to merge the rows in the table back to a single row for each composite sample. This means we will need to average the chemical measurements across the subsamples for each composite sample. Care must be taken, however, when averaging across certain measurements (see in particular the example for averaging pH, below). Here is a basic example for our table:

genome_groups <- combinedTab %>%

  group_by(genomicsSampleID) %>%

  summarize_at(c("d15N","organicd13C","nitrogenPercent","organicCPercent"), mean)



View(genome_groups)

You now have a table that you can use to analyze your metagenomic data with some of the biogeochemical data. Hopefully the few examples on this page can help you get started. Please let us know if there is a particular type of data you are interested in analyzing with the metagenomic data and we will add to this page (or similar).

Additional examples for accessing data

Here are some other examples to help you get started with NEON metagenomic data

Merging pH measurements

The example above showing how to merge data works for many straightforward measurements, for which calculating the mean is logical. For pH measures, however, this won't work. Since pH is a logarithmic scale, averaging them will not work. Fortunately, the R package respirometry includes the function mean_pH for averaging pH measurements. This function first converts each pH measure to hydrogen ion concentration [H+], averages the measures, then converts back to the logarithmic scale.

Below we show an example with the existing data. First we will create a new table with pH measurements, only keeping the samples from our metagenomic set:

soilpH_Example <- soilChem$sls_soilpH %>%

  dplyr::filter(sampleID %in% combinedTab$sampleID) %>%

  dplyr::select(sampleID,soilInWaterpH,soilInCaClpH)



# have a look at this table

View(soilpH_Example)

# now join with the existing table

combinedTab_pH <- left_join(combinedTab,soilpH_Example, by = "sampleID")

# and the final

View(combinedTab_pH)

Now, we can apply the same kind of tidyverse approach as the previous example, only using the mean_pH function:

library(respirometry)



genome_groups_pH <- combinedTab_pH %>%

  group_by(genomicsSampleID) %>%

  summarize_at(c("soilInWaterpH","soilInCaClpH"), mean_pH) 



View(genome_groups_pH)

One thing to note with the previous command is that all the other chemical data was lost when you ran the command. In this example we use two summarize_at commands to apply different functions to the two types of variables, and then left_join will combine them:

genome_groups_all <- combinedTab_pH %>%

  group_by(genomicsSampleID) %>%

  {left_join(

    summarize_at(.,vars("d15N","organicd13C","nitrogenPercent","organicCPercent"), mean),

    summarize_at(.,vars("soilInWaterpH","soilInCaClpH"), mean_pH)

  )}





View(genome_groups_all)

Downloading raw NEON metagenome data

If you are interested in accessing raw NEON metagenomic data (not processed as they are in NMDC) in order to run your own analyses, you can use the same neonUtilities package. The following code shows an example of how to download the raw sequence files.

library(neonUtilities)

library(dplyr)



#specify sites and/or date ranges depending on your research questions

metaGdata <- loadByProduct(dpID = 'DP1.10107.001',

                          startdate = "2018-01",

                          enddate = "2019-12",

                          check.size = FALSE,

                          package = 'expanded')

The following code will produce a list all of the files in that data product loaded above.

rawFileInfo <- metaGdata$mms_rawDataFiles



urls_fordownload <- unique(rawFileInfo$rawDataFilePath)

If you want to only access a subset of those data, there are several options. One option is to use a list of known sample IDs. In the following example we subset around the list of samples from the earlier example in this tutorial. Note here that the dnaSampleID in the rawFileInfo table is different than the genomicsSampleId in the combinedTab table. The first step then is to create that field so we can select from it.

# Option a, create a list of samples from earlier example

targetsamples <- combinedTab$genomicsSampleID

# Create a genomicsSampleID from raw files table, and then subset this to samples

rawfiles_ids <- rawFileInfo %>%

  mutate(genomicsSampleID = gsub("-DNA[1,2,3]","",dnaSampleID)) %>%

  subset(genomicsSampleID %in% targetsamples) %>% 

  unique()



urls_fordownload <- unique(rawfiles_ids$rawDataFilePath)

Another option is to download the raw data in chunks:

# Option b

urls_fordownload <- unique(rawFileInfo$rawDataFilePath)[1:20] #for instance, the first 20 rows, and so on

Now write the urls to a text file

#put in whatever working directory you want

file_conn = file("./NEONmetagenome_downloadIDs.txt") 

writeLines(unlist(urls_fordownload), file_conn, sep="\n")

close(file_conn)

Now go into command line/terminal (you can use the Terminal tab in RStudio) and navigate to the directory where your text file is and run the following in the terminal:

wget -i NEONmetagenome_downloadIDs.txt

The gzipped fastq files will download to whichever directory you are in. this will work both locally and on an HPC.