Data Notification
Coming updates to NEON microbial data
January 29, 2024
Over the last two years there have been a lot of changes in NEON's microbial data products, and data released in the coming year will include many updates and new, improved DNA sequence data. Improvements in sequencing technology and laboratory protocols are producing data of overall better quality and greatly expanded size. See below for details of the current status of individual data products; another update will be provided in May.
The metagenomics data products will have the biggest improvements, encompassing all sample types: soil (DP1.10107.001), benthic (DP1.20279.001), and surface water (DP1.20281.001). For the 2022 collections, DNA sequencing will move to the much larger capacity Illumina NovaSeq sequencing platform. As a result of this, the average number of sequence reads for each sample will increase about ten-fold. For the 2023 and 2024 collections, NEON will embark on a collaboration with the Joint Genome Institute (JGI) and the National Microbiome Data Collaborative (NMDC), through which all metagenomic samples will be sequenced by JGI and then analyzed through the NMDC data analysis pipeline. This is supported in part by a Community Science Proposal grant awarded to NEON this year. The JGI sequencing will result in an approximately 50-fold increase in sequence output over current NEON averages. Another exciting aspect of this collaboration is that all NEON metagenomic samples, past and present, will be incorporated into the NMDC database and run through their data analysis pipeline.
The sequencing of the 2022 samples will begin in February, 2024. These sequencing runs will also include the last samples remaining from the 2020 and 2021 field collections. The first sets of data will begin to be released in March, 2024, and all metagenome sequences from 2020 – 2022 should be available as provisional data by the end of April, 2024.
The sequencing of the 2023 metagenome samples will also begin in February. Due to the intensive data analysis that will accompany the sequencing there will be a longer lag time between sequencing and release. The first set of samples should be released in May, with the subsequent two batches to be released each month thereafter. Discussions are underway between NEON and NMDC as to how these data will be presented. NEON will continue to provide links to the raw data, as well as links to the metagenome annotations on NMDC.
The marker gene data products (soil: DP1.10108.001, benthic: DP1.20280.001, surface water: DP1.20282.001) have also gone through major changes. Over the past year and a half, the Rush University Genomics and Microbiome Core Facility (GCMF), a laboratory that specializes in environmental DNA, optimized the PCR and sequencing protocols for both ITS and 16S amplicon sequencing. These changes have resulted in greatly improved quality of the sequencing results, especially for the fungal ITS products. As well, the GMCF has upgraded to the Illumina NovaSeq to sequence the marker gene products, substantially increasing the number of sequences per sample.
Last year the Rush GMCF laboratory completed the fungal ITS sequencing of the 2019 samples. This wrapped up the 2019 marker gene sequencing (the bacterial 16S had already been done). All the 2019 marker gene data are part of the 2024 official release. The GMCF is currently sequencing the 2020 and 2021 samples, and they will be released as provisional data as they become available. The first sets of samples will be available online in February, with all samples from these two years expected to be completed by April. After completing the 2020 and 2021 samples, the GMCF will begin to sequence the 2022 and 2023 samples, with these data expected to be available by October.
Lastly, the microbe community composition data products (soil: DP1.10081.001, benthic: DP1.20086.001, and surface water: DP1.20141.001), are undergoing major modifications, primarily to the data analysis pipeline. These changes are being done to 1) make it easier to compare all samples across years and sites, 2) improve accessibility to the data by, for example, making it easier to import the sample data into popular metabarcoding programs such as phyloseq and Qiime2 for downstream ecological analysis, and 3) provide a modular pipeline system that is flexible and that can be adapted to new programs and analytic methods.
The revised community composition products will begin to be available on the data portal in the spring of 2024. The 2019 samples are targeted for release in April. As the 2020/2021 marker gene sequences become available, they will be run through the community composition pipeline, with the results published as provisional data by end of May. Likewise, for the 2022/2023 samples, the community composition analysis will be run as those sequences become available and are expected by October.