Tutorial
Work With NEON's Plant Phenology Data
Authors: Megan A. Jones, Natalie Robinson, Lee Stanish
Last Updated: May 13, 2021
Many organisms, including plants, show patterns of change across seasons - the different stages of this observable change are called phenophases. In this tutorial we explore how to work with NEON plant phenophase data.
Objectives
After completing this activity, you will be able to:
- work with NEON Plant Phenology Observation data.
- use dplyr functions to filter data.
- plot time series data in a bar plot using ggplot the function.
Things You’ll Need To Complete This Tutorial
You will need the most current version of R and, preferably, RStudio
loaded
on your computer to complete this tutorial.
Install R Packages
-
neonUtilities:
install.packages("neonUtilities")
-
ggplot2:
install.packages("ggplot2")
-
dplyr:
install.packages("dplyr")
More on Packages in R – Adapted from Software Carpentry.
Download Data
This tutorial is designed to have you download data directly from the NEON
portal API using the neonUtilities package. However, you can also directly
download this data, prepackaged, from FigShare. This data set includes all the
files needed for the Work with NEON OS & IS Data - Plant Phenology & Temperature
tutorial series. The data are in the format you would receive if downloading them
using the zipsByProduct()
function in the neonUtilities package.
Additional Resources
- NEON data portal
- NEON Plant Phenology Observations data product user guide
- RStudio's data wrangling (dplyr/tidyr) cheatsheet
- NEONScience GitHub Organization
- nneo API wrapper on CRAN
Plants change throughout the year - these are phenophases. Why do they change?
Explore Phenology Data
The following sections provide a brief overview of the NEON plant phenology observation data. When designing a research project using this data, you need to consult the documents associated with this data product and not rely solely on this summary.
The following description of the NEON Plant Phenology Observation data is modified from the data product user guide.
NEON Plant Phenology Observation Data
NEON collects plant phenology data and provides it as NEON data product DP1.10055.001.
The plant phenology observations data product provides in-situ observations of the phenological status and intensity of tagged plants (or patches) during discrete observations events.
Sampling occurs at all terrestrial field sites at site and season specific intervals. During Phase I (dominant species) sampling (pre-2021), three species with 30 individuals each are sampled. In 2021, Phase II (community) sampling will begin, with <=20 species with 5 or more individuals sampled will occur.
Status-based Monitoring
NEON employs status-based monitoring, in which the phenological condition of an individual is reported any time that individual is observed. At every observations bout, records are generated for every phenophase that is occurring and for every phenophase not occurring. With this approach, events (such as leaf emergence in Mediterranean climates, or flowering in many desert species) that may occur multiple times during a single year, can be captured. Continuous reporting of phenophase status enables quantification of the duration of phenophases rather than just their date of onset while allows enabling the explicit quantification of uncertainty in phenophase transition dates that are introduced by monitoring in discrete temporal bouts.
Specific products derived from this sampling include the observed phenophase status (whether or not a phenophase is occurring) and the intensity of phenophases for individuals in which phenophase status = ‘yes’. Phenophases reported are derived from the USA National Phenology Network (USA-NPN) categories. The number of phenophases observed varies by growth form and ranges from 1 phenophase (cactus) to 7 phenophases (semi-evergreen broadleaf). In this tutorial we will focus only on the state of the phenophase, not the phenophase intensity data.
Phenology Transects
Plant phenology observations occurs at all terrestrial NEON sites along an 800 meter square loop transect (primary) and within a 200 m x 200 m plot located within view of a canopy level, tower-mounted, phenology camera.
Timing of Observations
At each site, there are:
- ~50 observation bouts per year.
- no more that 100 sampling points per phenology transect.
- no more than 9 sampling points per phenocam plot.
- 1 annual measurement per year to collect annual size and disease status measurements from each sampling point.
Available Data Tables
In the downloaded data packet, data are available in two main files
- phe_statusintensity: Plant phenophase status and intensity data
- phe_perindividual: Geolocation and taxonomic identification for phenology plants
- phe_perindividualperyear: recorded once a year, essentially the "metadata" about the plant: DBH, height, etc.
There are other files in each download including a readme with information on the data product and the download; a variables file that defines the term descriptions, data types, and units; a validation file with data entry validation and parsing rules; and an XML with machine readable metadata.
Stack NEON Data
NEON data are delivered in a site and year-month format. When you download data, you will get a single zipped file containing a directory for each month and site that you've requested data for. Dealing with these separate tables from even one or two sites over a 12 month period can be a bit overwhelming. Luckily NEON provides an R package neonUtilities that takes the unzipped downloaded file and joining the data files. The teaching data downloaded with this tutorial is already stacked. If you are working with other NEON data, please go through the tutorial to stack the data in R or in Python and then return to this tutorial.
Work with NEON Data
When we do this for phenology data we get three files, one for each data table, with all the data from your site and date range of interest.
First, we need to set up our R environment.
# install needed package (only uncomment & run if not already installed)
#install.packages("neonUtilities")
#install.packages("dplyr")
#install.packages("ggplot2")
# load needed packages
library(neonUtilities)
library(dplyr)
library(ggplot2)
options(stringsAsFactors=F) #keep strings as character type not factors
# set working directory to ensure R can find the file we wish to import and where
# we want to save our files. Be sure to move the download into your working directory!
wd <- "~/Git/data/" # Change this to match your local environment
setwd(wd)
Let's start by loading our data of interest. For this series, we'll work with date from the NEON Domain 02 sites:
- Blandy Farm (BLAN)
- Smithsonian Conservation Biology Institute (SCBI)
- Smithsonian Environmental Research Center (SERC)
And we'll use data from January 2017 to December 2019. This downloads over 9MB of data. If this is too large, use a smaller date range. If you opt to do this, your figures and some output may look different later in the tutorial.
With this information, we can download our data using the neonUtilities package.
If you are not using a NEON token to download your data, remove the
token = Sys.getenv(NEON_TOKEN)
line of code (learn more about NEON API tokens
in the
Using an API Token when Accessing NEON Data with neonUtilities tutorial).
If you are using the data downloaded at the start of the tutorial, use the commented out code in the second half of this code chunk.
## Two options for accessing data - programmatic or from the example dataset
# Read data from data portal
phe <- loadByProduct(dpID = "DP1.10055.001", site=c("BLAN","SCBI","SERC"),
startdate = "2017-01", enddate="2019-12",
token = Sys.getenv("NEON_TOKEN"),
check.size = F)
## API token was not recognized. Public rate limit applied.
## Finding available files
##
|
| | 0%
|
|= | 1%
|
|= | 2%
|
|== | 3%
|
|=== | 4%
|
|=== | 5%
|
|==== | 6%
|
|==== | 7%
|
|===== | 8%
|
|====== | 9%
|
|====== | 11%
|
|======= | 12%
|
|======== | 13%
|
|======== | 14%
|
|========= | 15%
|
|========== | 16%
|
|========== | 17%
|
|=========== | 18%
|
|============ | 19%
|
|============ | 20%
|
|============= | 21%
|
|============= | 22%
|
|============== | 23%
|
|=============== | 24%
|
|=============== | 25%
|
|================ | 26%
|
|================= | 27%
|
|================= | 28%
|
|================== | 29%
|
|=================== | 31%
|
|=================== | 32%
|
|==================== | 33%
|
|===================== | 34%
|
|===================== | 35%
|
|====================== | 36%
|
|====================== | 37%
|
|======================= | 38%
|
|======================== | 39%
|
|======================== | 40%
|
|========================= | 41%
|
|========================== | 42%
|
|========================== | 43%
|
|=========================== | 44%
|
|============================ | 45%
|
|============================ | 46%
|
|============================= | 47%
|
|============================== | 48%
|
|============================== | 49%
|
|=============================== | 51%
|
|=============================== | 52%
|
|================================ | 53%
|
|================================= | 54%
|
|================================= | 55%
|
|================================== | 56%
|
|=================================== | 57%
|
|=================================== | 58%
|
|==================================== | 59%
|
|===================================== | 60%
|
|===================================== | 61%
|
|====================================== | 62%
|
|======================================= | 63%
|
|======================================= | 64%
|
|======================================== | 65%
|
|======================================== | 66%
|
|========================================= | 67%
|
|========================================== | 68%
|
|========================================== | 69%
|
|=========================================== | 71%
|
|============================================ | 72%
|
|============================================ | 73%
|
|============================================= | 74%
|
|============================================== | 75%
|
|============================================== | 76%
|
|=============================================== | 77%
|
|================================================ | 78%
|
|================================================ | 79%
|
|================================================= | 80%
|
|================================================= | 81%
|
|================================================== | 82%
|
|=================================================== | 83%
|
|=================================================== | 84%
|
|==================================================== | 85%
|
|===================================================== | 86%
|
|===================================================== | 87%
|
|====================================================== | 88%
|
|======================================================= | 89%
|
|======================================================= | 91%
|
|======================================================== | 92%
|
|========================================================= | 93%
|
|========================================================= | 94%
|
|========================================================== | 95%
|
|========================================================== | 96%
|
|=========================================================== | 97%
|
|============================================================ | 98%
|
|============================================================ | 99%
|
|=============================================================| 100%
##
## Downloading files totaling approximately 7.985319 MB
## Downloading 95 files
##
|
| | 0%
|
|= | 1%
|
|= | 2%
|
|== | 3%
|
|=== | 4%
|
|=== | 5%
|
|==== | 6%
|
|===== | 7%
|
|===== | 9%
|
|====== | 10%
|
|====== | 11%
|
|======= | 12%
|
|======== | 13%
|
|======== | 14%
|
|========= | 15%
|
|========== | 16%
|
|========== | 17%
|
|=========== | 18%
|
|============ | 19%
|
|============ | 20%
|
|============= | 21%
|
|============== | 22%
|
|============== | 23%
|
|=============== | 24%
|
|================ | 26%
|
|================ | 27%
|
|================= | 28%
|
|================== | 29%
|
|================== | 30%
|
|=================== | 31%
|
|=================== | 32%
|
|==================== | 33%
|
|===================== | 34%
|
|===================== | 35%
|
|====================== | 36%
|
|======================= | 37%
|
|======================= | 38%
|
|======================== | 39%
|
|========================= | 40%
|
|========================= | 41%
|
|========================== | 43%
|
|=========================== | 44%
|
|=========================== | 45%
|
|============================ | 46%
|
|============================= | 47%
|
|============================= | 48%
|
|============================== | 49%
|
|============================== | 50%
|
|=============================== | 51%
|
|================================ | 52%
|
|================================ | 53%
|
|================================= | 54%
|
|================================== | 55%
|
|================================== | 56%
|
|=================================== | 57%
|
|==================================== | 59%
|
|==================================== | 60%
|
|===================================== | 61%
|
|====================================== | 62%
|
|====================================== | 63%
|
|======================================= | 64%
|
|======================================== | 65%
|
|======================================== | 66%
|
|========================================= | 67%
|
|========================================== | 68%
|
|========================================== | 69%
|
|=========================================== | 70%
|
|=========================================== | 71%
|
|============================================ | 72%
|
|============================================= | 73%
|
|============================================= | 74%
|
|============================================== | 76%
|
|=============================================== | 77%
|
|=============================================== | 78%
|
|================================================ | 79%
|
|================================================= | 80%
|
|================================================= | 81%
|
|================================================== | 82%
|
|=================================================== | 83%
|
|=================================================== | 84%
|
|==================================================== | 85%
|
|===================================================== | 86%
|
|===================================================== | 87%
|
|====================================================== | 88%
|
|======================================================= | 89%
|
|======================================================= | 90%
|
|======================================================== | 91%
|
|======================================================== | 93%
|
|========================================================= | 94%
|
|========================================================== | 95%
|
|========================================================== | 96%
|
|=========================================================== | 97%
|
|============================================================ | 98%
|
|============================================================ | 99%
|
|=============================================================| 100%
##
## Unpacking zip files using 1 cores.
## Stacking operation across a single core.
## Stacking table phe_perindividual
## Stacking table phe_statusintensity
## Stacking table phe_perindividualperyear
## Copied the most recent publication of validation file to /stackedFiles
## Copied the most recent publication of categoricalCodes file to /stackedFiles
## Copied the most recent publication of variable definition file to /stackedFiles
## Finished: Stacked 3 data tables and 3 metadata tables!
## Stacking took 1.46806 secs
# if you aren't sure you can handle the data file size use check.size = T.
# save dataframes from the downloaded list
ind <- phe$phe_perindividual #individual information
status <- phe$phe_statusintensity #status & intensity info
##If choosing to use example dataset downloaded from this tutorial:
# Stack multiple files within the downloaded phenology data
#stackByTable("NEON-pheno-temp-timeseries_v2/filesToStack10055", folder = T)
# read in data - readTableNEON uses the variables file to assign the correct
# data type for each variable
#ind <- readTableNEON('NEON-pheno-temp-timeseries_v2/filesToStack10055/stackedFiles/phe_perindividual.csv', 'NEON-pheno-temp-timeseries_v2/filesToStack10055/stackedFiles/variables_10055.csv')
#status <- readTableNEON('NEON-pheno-temp-timeseries_v2/filesToStack10055/stackedFiles/phe_statusintensity.csv', 'NEON-pheno-temp-timeseries_v2/filesToStack10055/stackedFiles/variables_10055.csv')
Let's explore the data. Let's get to know what the ind
dataframe looks like.
# What are the fieldnames in this dataset?
names(ind)
## [1] "uid" "namedLocation"
## [3] "domainID" "siteID"
## [5] "plotID" "decimalLatitude"
## [7] "decimalLongitude" "geodeticDatum"
## [9] "coordinateUncertainty" "elevation"
## [11] "elevationUncertainty" "subtypeSpecification"
## [13] "transectMeter" "directionFromTransect"
## [15] "ninetyDegreeDistance" "sampleLatitude"
## [17] "sampleLongitude" "sampleGeodeticDatum"
## [19] "sampleCoordinateUncertainty" "sampleElevation"
## [21] "sampleElevationUncertainty" "date"
## [23] "editedDate" "individualID"
## [25] "taxonID" "scientificName"
## [27] "identificationQualifier" "taxonRank"
## [29] "nativeStatusCode" "growthForm"
## [31] "vstTag" "samplingProtocolVersion"
## [33] "measuredBy" "identifiedBy"
## [35] "recordedBy" "remarks"
## [37] "dataQF" "publicationDate"
## [39] "release"
# Unsure of what some of the variables are you? Look at the variables table!
View(phe$variables_10055)
# if using the pre-downloaded data, you need to read in the variables file
# or open and look at it on your desktop
#var <- read.csv('NEON-pheno-temp-timeseries_v2/filesToStack10055/stackedFiles/variables_10055.csv')
#View(var)
# how many rows are in the data?
nrow(ind)
## [1] 433
# look at the first six rows of data.
#head(ind) #this is a good function to use but looks messy so not rendering it
# look at the structure of the dataframe.
str(ind)
## 'data.frame': 433 obs. of 39 variables:
## $ uid : chr "76bf37d9-c834-43fc-a430-83d87e4b9289" "cf0239bb-2953-44a8-8fd2-051539be5727" "833e5f41-d5cb-4550-ba60-e6f000a2b1b6" "6c2e348d-d19e-4543-9d22-0527819ee964" ...
## $ namedLocation : chr "BLAN_061.phenology.phe" "BLAN_061.phenology.phe" "BLAN_061.phenology.phe" "BLAN_061.phenology.phe" ...
## $ domainID : chr "D02" "D02" "D02" "D02" ...
## $ siteID : chr "BLAN" "BLAN" "BLAN" "BLAN" ...
## $ plotID : chr "BLAN_061" "BLAN_061" "BLAN_061" "BLAN_061" ...
## $ decimalLatitude : num 39.1 39.1 39.1 39.1 39.1 ...
## $ decimalLongitude : num -78.1 -78.1 -78.1 -78.1 -78.1 ...
## $ geodeticDatum : chr NA NA NA NA ...
## $ coordinateUncertainty : num NA NA NA NA NA NA NA NA NA NA ...
## $ elevation : num 183 183 183 183 183 183 183 183 183 183 ...
## $ elevationUncertainty : num NA NA NA NA NA NA NA NA NA NA ...
## $ subtypeSpecification : chr "primary" "primary" "primary" "primary" ...
## $ transectMeter : num 491 464 537 15 753 506 527 305 627 501 ...
## $ directionFromTransect : chr "Left" "Right" "Left" "Left" ...
## $ ninetyDegreeDistance : num 0.5 4 2 3 2 1 2 3 2 3 ...
## $ sampleLatitude : num NA NA NA NA NA NA NA NA NA NA ...
## $ sampleLongitude : num NA NA NA NA NA NA NA NA NA NA ...
## $ sampleGeodeticDatum : chr "WGS84" "WGS84" "WGS84" "WGS84" ...
## $ sampleCoordinateUncertainty: num NA NA NA NA NA NA NA NA NA NA ...
## $ sampleElevation : num NA NA NA NA NA NA NA NA NA NA ...
## $ sampleElevationUncertainty : num NA NA NA NA NA NA NA NA NA NA ...
## $ date : POSIXct, format: "2016-04-20" ...
## $ editedDate : POSIXct, format: "2016-05-09" ...
## $ individualID : chr "NEON.PLA.D02.BLAN.06290" "NEON.PLA.D02.BLAN.06501" "NEON.PLA.D02.BLAN.06204" "NEON.PLA.D02.BLAN.06223" ...
## $ taxonID : chr "RHDA" "SOAL6" "RHDA" "LOMA6" ...
## $ scientificName : chr "Rhamnus davurica Pall." "Solidago altissima L." "Rhamnus davurica Pall." "Lonicera maackii (Rupr.) Herder" ...
## $ identificationQualifier : chr NA NA NA NA ...
## $ taxonRank : chr "species" "species" "species" "species" ...
## $ nativeStatusCode : chr "I" "N" "I" "I" ...
## $ growthForm : chr "Deciduous broadleaf" "Forb" "Deciduous broadleaf" "Deciduous broadleaf" ...
## $ vstTag : chr NA NA NA NA ...
## $ samplingProtocolVersion : chr NA "NEON.DOC.014040vJ" "NEON.DOC.014040vJ" "NEON.DOC.014040vJ" ...
## $ measuredBy : chr "jcoloso@neoninc.org" "jward@battelleecology.org" "alandes@field-ops.org" "alandes@field-ops.org" ...
## $ identifiedBy : chr "shackley@neoninc.org" "llemmon@field-ops.org" "llemmon@field-ops.org" "llemmon@field-ops.org" ...
## $ recordedBy : chr "shackley@neoninc.org" NA NA NA ...
## $ remarks : chr "Nearly dead shaded out" "no entry" "no entry" "no entry" ...
## $ dataQF : chr NA NA NA NA ...
## $ publicationDate : chr "20201218T103411Z" "20201218T103411Z" "20201218T103411Z" "20201218T103411Z" ...
## $ release : chr "RELEASE-2021" "RELEASE-2021" "RELEASE-2021" "RELEASE-2021" ...
Notice that the neonUtilities package read the data type from the variables file and then automatically converts the data to the correct date type in R.
(Note that if you first opened your data file in Excel, you might see 06/14/2014 as the format instead of 2014-06-14. Excel can do some ~~weird~~ interesting things to dates.)
Phenology status
Now let's look at the status data.
# What variables are included in this dataset?
names(status)
## [1] "uid" "namedLocation"
## [3] "domainID" "siteID"
## [5] "plotID" "date"
## [7] "editedDate" "dayOfYear"
## [9] "individualID" "phenophaseName"
## [11] "phenophaseStatus" "phenophaseIntensityDefinition"
## [13] "phenophaseIntensity" "samplingProtocolVersion"
## [15] "measuredBy" "recordedBy"
## [17] "remarks" "dataQF"
## [19] "publicationDate" "release"
nrow(status)
## [1] 219357
#head(status) #this is a good function to use but looks messy so not rendering it
str(status)
## 'data.frame': 219357 obs. of 20 variables:
## $ uid : chr "b69ada55-41d1-41c7-9031-149c54de51f9" "9be6f7ad-4422-40ac-ba7f-e32e0184782d" "58e7aeaf-163c-4ea2-ad75-db79a580f2f8" "efe7ca02-d09e-4964-b35d-aebdac8f3efb" ...
## $ namedLocation : chr "BLAN_061.phenology.phe" "BLAN_061.phenology.phe" "BLAN_061.phenology.phe" "BLAN_061.phenology.phe" ...
## $ domainID : chr "D02" "D02" "D02" "D02" ...
## $ siteID : chr "BLAN" "BLAN" "BLAN" "BLAN" ...
## $ plotID : chr "BLAN_061" "BLAN_061" "BLAN_061" "BLAN_061" ...
## $ date : POSIXct, format: "2017-02-24" ...
## $ editedDate : POSIXct, format: "2017-03-31" ...
## $ dayOfYear : num 55 55 55 55 55 55 55 55 55 55 ...
## $ individualID : chr "NEON.PLA.D02.BLAN.06229" "NEON.PLA.D02.BLAN.06226" "NEON.PLA.D02.BLAN.06222" "NEON.PLA.D02.BLAN.06223" ...
## $ phenophaseName : chr "Leaves" "Leaves" "Leaves" "Leaves" ...
## $ phenophaseStatus : chr "no" "no" "no" "no" ...
## $ phenophaseIntensityDefinition: chr NA NA NA NA ...
## $ phenophaseIntensity : chr NA NA NA NA ...
## $ samplingProtocolVersion : chr NA NA NA NA ...
## $ measuredBy : chr "llemmon@neoninc.org" "llemmon@neoninc.org" "llemmon@neoninc.org" "llemmon@neoninc.org" ...
## $ recordedBy : chr "llemmon@neoninc.org" "llemmon@neoninc.org" "llemmon@neoninc.org" "llemmon@neoninc.org" ...
## $ remarks : chr NA NA NA NA ...
## $ dataQF : chr "legacyData" "legacyData" "legacyData" "legacyData" ...
## $ publicationDate : chr "20201217T203824Z" "20201217T203824Z" "20201217T203824Z" "20201217T203824Z" ...
## $ release : chr "RELEASE-2021" "RELEASE-2021" "RELEASE-2021" "RELEASE-2021" ...
# date range
min(status$date)
## [1] "2017-02-24 GMT"
max(status$date)
## [1] "2019-12-12 GMT"
Clean up the Data
- remove duplicates (full rows)
- convert to date format
- retain only the most recent
editedDate
in the perIndividual and status table.
Remove Duplicates
The individual table (ind) file is included in each site by month-year file. As a result when all the tables are stacked there are many duplicates.
Let's remove any duplicates that exist.
# drop UID as that will be unique for duplicate records
ind_noUID <- select(ind, -(uid))
status_noUID <- select(status, -(uid))
# remove duplicates
## expect many
ind_noD <- distinct(ind_noUID)
nrow(ind_noD)
## [1] 433
status_noD<-distinct(status_noUID)
nrow(status_noD)
## [1] 216837
Variable Overlap between Tables
From the initial inspection of the data we can see there is overlap in variable names between the fields.
Let's see what they are.
# where is there an intersection of names
intersect(names(status_noD), names(ind_noD))
## [1] "namedLocation" "domainID"
## [3] "siteID" "plotID"
## [5] "date" "editedDate"
## [7] "individualID" "samplingProtocolVersion"
## [9] "measuredBy" "recordedBy"
## [11] "remarks" "dataQF"
## [13] "publicationDate" "release"
There are several fields that overlap between the datasets. Some of these are expected to be the same and will be what we join on.
However, some of these will have different values in each table. We want to keep those distinct value and not join on them. Therefore, we can rename these fields before joining:
- date
- editedDate
- measuredBy
- recordedBy
- samplingProtocolVersion
- remarks
- dataQF
- publicationDate
Now we want to rename the variables that would have duplicate names. We can rename all the variables in the status object to have "Stat" at the end of the variable name.
# in Status table rename like columns
status_noD <- rename(status_noD, dateStat=date,
editedDateStat=editedDate, measuredByStat=measuredBy,
recordedByStat=recordedBy,
samplingProtocolVersionStat=samplingProtocolVersion,
remarksStat=remarks, dataQFStat=dataQF,
publicationDateStat=publicationDate)
Filter to last editedDate
The individual (ind) table contains all instances that any of the location or
taxonomy data of an individual was updated. Therefore there are many rows for
some individuals. We only want the latest editedDate
on ind.
# retain only the max of the date for each individualID
ind_last <- ind_noD %>%
group_by(individualID) %>%
filter(editedDate==max(editedDate))
# oh wait, duplicate dates, retain only the most recent editedDate
ind_lastnoD <- ind_last %>%
group_by(editedDate, individualID) %>%
filter(row_number()==1)
Join Dataframes
Now we can join the two data frames on all the variables with the same name.
We use a left_join()
from the dpylr package because we want to match all the
rows from the "left" (first) dataframe to any rows that also occur in the "right"
(second) dataframe.
Check out RStudio's data wrangling (dplyr/tidyr) cheatsheet for other types of joins.
# Create a new dataframe "phe_ind" with all the data from status and some from ind_lastnoD
phe_ind <- left_join(status_noD, ind_lastnoD)
## Joining, by = c("namedLocation", "domainID", "siteID", "plotID", "individualID", "release")
Now that we have clean datasets we can begin looking into our particular data to address our research question: do plants show patterns of changes in phenophase across season?
Patterns in Phenophase
From our larger dataset (several sites, species, phenophases), let's create a
dataframe with only the data from a single site, species, and phenophase and
call it phe_1sp
.
Select Site(s) of Interest
To do this, we'll first select our site of interest. Note how we set this up with an object that is our site of interest. This will allow us to more easily change which site or sites if we want to adapt our code later.
# set site of interest
siteOfInterest <- "SCBI"
# use filter to select only the site of Interest
## using %in% allows one to add a vector if you want more than one site.
## could also do it with == instead of %in% but won't work with vectors
phe_1st <- filter(phe_ind, siteID %in% siteOfInterest)
Select Species of Interest
Now we may only want to view a single species or a set of species. Let's first
look at the species that are present in our data. We could do this just by looking
at the taxonID
field which give the four letter UDSA plant code for each
species. But if we don't know all the plant codes, we can get a bit fancier and
view both
# see which species are present - taxon ID only
unique(phe_1st$taxonID)
## [1] "JUNI" "MIVI" "LITU"
# or see which species are present with taxon ID + species name
unique(paste(phe_1st$taxonID, phe_1st$scientificName, sep=' - '))
## [1] "JUNI - Juglans nigra L."
## [2] "MIVI - Microstegium vimineum (Trin.) A. Camus"
## [3] "LITU - Liriodendron tulipifera L."
For now, let's choose only the flowering tree Liriodendron tulipifera (LITU).
By writing it this way, we could also add a list of species to the speciesOfInterest
object to select for multiple species.
speciesOfInterest <- "LITU"
#subset to just "LITU"
# here just use == but could also use %in%
phe_1sp <- filter(phe_1st, taxonID==speciesOfInterest)
# check that it worked
unique(phe_1sp$taxonID)
## [1] "LITU"
Select Phenophase of Interest
And, perhaps a single phenophase.
# see which phenophases are present
unique(phe_1sp$phenophaseName)
## [1] "Open flowers" "Breaking leaf buds"
## [3] "Colored leaves" "Increasing leaf size"
## [5] "Falling leaves" "Leaves"
phenophaseOfInterest <- "Leaves"
#subset to just the phenosphase of interest
phe_1sp <- filter(phe_1sp, phenophaseName %in% phenophaseOfInterest)
# check that it worked
unique(phe_1sp$phenophaseName)
## [1] "Leaves"
Select only Primary Plots
NEON plant phenology observations are collected along two types of plots.
- Primary plots: an 800 meter square phenology loop transect
- Phenocam plots: a 200 m x 200 m plot located within view of a canopy level, tower-mounted, phenology camera
In the data, these plots are differentiated by the subtypeSpecification
.
Depending on your question you may want to use only one or both of these plot types.
For this activity, we're going to only look at the primary plots.
# what plots are present?
unique(phe_1sp$subtypeSpecification)
## [1] "primary" "phenocam"
# filter
phe_1spPrimary <- filter(phe_1sp, subtypeSpecification == 'primary')
# check that it worked
unique(phe_1spPrimary$subtypeSpecification)
## [1] "primary"
Total in Phenophase of Interest
The phenophaseState
is recorded as "yes" or "no" that the individual is in that
phenophase. The phenophaseIntensity
are categories for how much of the individual
is in that state. For now, we will stick with phenophaseState
.
We can now calculate the total number of individuals with that state. We use
n_distinct(indvidualID)
count the individuals (and not the records) in case
there are duplicate records for an individual.
But later on we'll also want to calculate the percent of the observed individuals in the "leaves" status, therefore, we're also adding in a step here to retain the sample size so that we can calculate % later.
Here we use pipes %>%
from the dpylr package to "pass" objects onto the next
function.
# Calculate sample size for later use
sampSize <- phe_1spPrimary %>%
group_by(dateStat) %>%
summarise(numInd= n_distinct(individualID))
# Total in status by day for distinct individuals
inStat <- phe_1spPrimary%>%
group_by(dateStat, phenophaseStatus)%>%
summarise(countYes=n_distinct(individualID))
## `summarise()` has grouped output by 'dateStat'. You can override using the `.groups` argument.
inStat <- full_join(sampSize, inStat, by="dateStat")
# Retain only Yes
inStat_T <- filter(inStat, phenophaseStatus %in% "yes")
# check that it worked
unique(inStat_T$phenophaseStatus)
## [1] "yes"
Now that we have the data we can plot it.
Plot with ggplot
The ggplot()
function within the ggplot2
package gives us considerable control
over plot appearance. Three basic elements are needed for ggplot()
to work:
- The data_frame: containing the variables that we wish to plot,
-
aes
(aesthetics): which denotes which variables will map to the x-, y- (and other) axes, -
geom_XXXX
(geometry): which defines the data's graphical representation (e.g. points (geom_point
), bars (geom_bar
), lines (geom_line
), etc).
The syntax begins with the base statement that includes the data_frame
(inStat_T
) and associated x (date
) and y (n
) variables to be
plotted:
ggplot(inStat_T, aes(date, n))
Bar Plots with ggplot
To successfully plot, the last piece that is needed is the geom
etry type.
To create a bar plot, we set the geom
element from to geom_bar()
.
The default setting for a ggplot bar plot - geom_bar()
- is a histogram
designated by stat="bin"
. However, in this case, we want to plot count values.
We can use geom_bar(stat="identity")
to force ggplot to plot actual values.
# plot number of individuals in leaf
phenoPlot <- ggplot(inStat_T, aes(dateStat, countYes)) +
geom_bar(stat="identity", na.rm = TRUE)
phenoPlot
# Now let's make the plot look a bit more presentable
phenoPlot <- ggplot(inStat_T, aes(dateStat, countYes)) +
geom_bar(stat="identity", na.rm = TRUE) +
ggtitle("Total Individuals in Leaf") +
xlab("Date") + ylab("Number of Individuals") +
theme(plot.title = element_text(lineheight=.8, face="bold", size = 20)) +
theme(text = element_text(size=18))
phenoPlot
We could also covert this to percentage and plot that.
# convert to percent
inStat_T$percent<- ((inStat_T$countYes)/inStat_T$numInd)*100
# plot percent of leaves
phenoPlot_P <- ggplot(inStat_T, aes(dateStat, percent)) +
geom_bar(stat="identity", na.rm = TRUE) +
ggtitle("Proportion in Leaf") +
xlab("Date") + ylab("% of Individuals") +
theme(plot.title = element_text(lineheight=.8, face="bold", size = 20)) +
theme(text = element_text(size=18))
phenoPlot_P
The plots demonstrate the nice expected pattern of increasing leaf-out, peak, and drop-off.
Drivers of Phenology
Now that we see that there are differences in and shifts in phenophases, what are the drivers of phenophases?
The NEON phenology measurements track sensitive and easily observed indicators of biotic responses to climate variability by monitoring the timing and duration of phenological stages in plant communities. Plant phenology is affected by forces such as temperature, timing and duration of pest infestations and disease outbreaks, water fluxes, nutrient budgets, carbon dynamics, and food availability and has feedbacks to trophic interactions, carbon sequestration, community composition and ecosystem function. (quoted from Plant Phenology Observations user guide.)
Filter by Date
In the next part of this series, we will be exploring temperature as a driver of phenology. Temperature date is quite large (NEON provides this in 1 minute or 30 minute intervals) so let's trim our phenology date down to only one year so that we aren't working with as large a data.
Let's filter to just 2018 data.
# use filter to select only the date of interest
phe_1sp_2018 <- filter(inStat_T, dateStat >= "2018-01-01" & dateStat <= "2018-12-31")
# did it work?
range(phe_1sp_2018$dateStat)
## [1] "2018-04-13 GMT" "2018-11-20 GMT"
How does that look?
# Now let's make the plot look a bit more presentable
phenoPlot18 <- ggplot(phe_1sp_2018, aes(dateStat, countYes)) +
geom_bar(stat="identity", na.rm = TRUE) +
ggtitle("Total Individuals in Leaf") +
xlab("Date") + ylab("Number of Individuals") +
theme(plot.title = element_text(lineheight=.8, face="bold", size = 20)) +
theme(text = element_text(size=18))
phenoPlot18
Now that we've filtered down to just the 2018 data from SCBI for LITU in leaf, we may want to save that subsetted data for another use. To do that you can write the data frame to a .csv file.
You do not need to follow this step if you are continuing on to the next tutorials in this series as you already have the data frame in your environment. Of course if you close R and then come back to it, you will need to re-load this data and instructions for that are provided in the relevant tutorials.
# Write .csv - this step is optional
# This will write to your current working directory, change as desired.
write.csv( phe_1sp_2018 , file="NEONpheno_LITU_Leaves_SCBI_2018.csv", row.names=F)
#If you are using the downloaded example date, this code will write it to the
# pheno data file. Note - this file is already a part of the download.
#write.csv( phe_1sp_2018 , file="NEON-pheno-temp-timeseries_v2/NEONpheno_LITU_Leaves_SCBI_2018.csv", row.names=F)