This tutorial explores working with date and time field in R. We will overview
the differences between as.Date, POSIXct and POSIXlt as used to convert
a date / time field in character (string) format to a date-time format that is
recognized by R. This conversion supports efficient plotting, subsetting and
analysis of time series data.
Learning Objectives
After completing this tutorial, you will be able to:
Describe various date-time classes and data structure in R.
Explain the difference between POSIXct and POSIXlt data classes are and
why POSIXct may be preferred for some tasks.
Convert a column containing date-time information in character
format to a date-time R class.
Convert a date-time column to different date-time classes.
Write out a date-time class object in different ways (month-day,
month-day-year, etc).
Things You’ll Need To Complete This Tutorials
You will need the most current version of R and, preferably, RStudio loaded on
your computer to complete this tutorial.
R Script & Challenge Code: NEON data lessons often contain challenges that reinforce
learned skills. If available, the code for challenge solutions is found in the
downloadable R script of the entire lesson, available in the footer of each lesson page.
The Data Approach
Intro to Time Series Data in R tutorial
we imported a time series dataset in .csv format into R. We learned how to
quickly plot these data by converting the date column to an R Date class.
In this tutorial we will explore how to work with a column that contains both a
date AND a time stamp.
We will use functions from both base R and the lubridate package to work
with date-time data classes.
# Load packages required for entire script
library(lubridate) #work with dates
#Set the working directory and place your downloaded data there
wd <- "~/Git/data/"
Import CSV File
First, let's import our time series data. We are interested in temperature,
precipitation and photosynthetically active radiation (PAR) - metrics that are
strongly associated with vegetation green-up and brown down (phenology or
phenophase timing). We will use the hf001-10-15min-m.csv file
that contains atmospheric data for the NEON Harvard Forest field site,
aggregated at 15-minute intervals. Download the dataset for these exercises here.
# Load csv file of 15 min meteorological data from Harvard Forest
# contained within the downloaded directory, or available for download
# directly from:
# https://harvardforest.fas.harvard.edu/data/p00/hf001/hf001-10-15min-m.csv
# Factors=FALSE so strings, series of letters/words/numerals, remain characters
harMet_15Min <- read.csv(
file=paste0(wd,"NEON-DS-Met-Time-Series/HARV/FisherTower-Met/hf001-10-15min-m.csv"),
stringsAsFactors = FALSE)
Date and Time Data
Let's revisit the data structure of our harMet_15Min object. What is the class
of the date-time column?
# view column data class
class(harMet_15Min$datetime)
## [1] "character"
# view sample data
head(harMet_15Min$datetime)
## [1] "2005-01-01T00:15" "2005-01-01T00:30" "2005-01-01T00:45"
## [4] "2005-01-01T01:00" "2005-01-01T01:15" "2005-01-01T01:30"
Our datetime column is stored as a character class. We need to convert it to
date-time class. What happens when we use the as.Date method that we learned
about in the
Intro to Time Series Data in R tutorial?
# convert column to date class
dateOnly_HARV <- as.Date(harMet_15Min$datetime)
# view data
head(dateOnly_HARV)
## [1] "2005-01-01" "2005-01-01" "2005-01-01" "2005-01-01" "2005-01-01"
## [6] "2005-01-01"
When we use as.Date, we lose the time stamp.
Explore Date and Time Classes
R - Date Class - as.Date
As we just saw, the as.Date format doesn't store any time information. When we
use theas.Date method to convert a date stored as a character class to an R
date class, it will ignore all values after the date string.
# Convert character data to date (no time)
myDate <- as.Date("2015-10-19 10:15")
str(myDate)
## Date[1:1], format: "2015-10-19"
# what happens if the date has text at the end?
myDate2 <- as.Date("2015-10-19Hello")
str(myDate2)
## Date[1:1], format: "2015-10-19"
As we can see above the as.Date() function will convert the characters that it
recognizes to be part of a date into a date class and ignore all other
characters in the string.
R - Date-Time - The POSIX classes
If we have a column containing both date and time we need to use a class that
stores both date AND time. Base R offers two closely related classes for date
and time: POSIXct and POSIXlt.
POSIXct
The as.POSIXct method converts a date-time string into a POSIXct class.
# Convert character data to date and time.
timeDate <- as.POSIXct("2015-10-19 10:15")
str(timeDate)
## POSIXct[1:1], format: "2015-10-19 10:15:00"
timeDate
## [1] "2015-10-19 10:15:00 MDT"
as.POSIXct stores both a date and time with an associated time zone. The
default time zone selected, is the time zone that your computer is set to which
is most often your local time zone.
POSIXct stores date and time in seconds with the number of seconds beginning
at 1 January 1970. Negative numbers are used to store dates prior to 1970.
Thus, the POSIXct format stores each date and time a single value in units of
seconds. Storing the data this way, optimizes use in data.frames and speeds up
computation, processing and conversion to other formats.
# to see the data in this 'raw' format, i.e., not formatted according to the
# class type to show us a date we recognize, use the `unclass()` function.
unclass(timeDate)
## [1] 1445271300
## attr(,"tzone")
## [1] ""
Here we see the number of seconds from 1 January 1970 to 9 October 2015
(1445271300). Also, we see that a time zone (tzone) is associated with the value in seconds.
**Data Tip:** The `unclass` method in R allows you
to view how a particular R object is stored.
POSIXlt
The POSIXct format is optimized for storage and computation. However, humans
aren't quite as computationally efficient as computers! Also, we often want to
quickly extract some portion of the data (e.g., months). The POSIXlt class
stores date and time information in a format that we are used to seeing (e.g.,
second, min, hour, day of month, month, year, numeric day of year, etc).
When we convert a string to POSIXlt, and view it in R, it still looks
similar to the POSIXct format. However, unclass() shows us that the data are
stored differently. The POSIXlt class stores the hour, minute, second, day,
month, and year separately.
Months in POSIXlt
POSIXlt has a few quirks. First, the month values stored in the POSIXlt
object use zero-based indexing. This means that month #1 (January) is stored
as 0, and month #2 (February) is stored as 1. Notice in the output above,
October is stored as the 9th month ($mon = 9).
Years in POSIXlt
Years are also stored differently in the POSIXlt class. Year values are stored
using a base index value of 1900. Thus, 2015 is stored as 115 ($year = 115
115 years since 1900).
**Data Tip:** To learn more about how R stores
information within date-time and other objects check out the R documentation
by using `?` (e.g., `?POSIXlt`). NOTE: you can use this same syntax to learn
about particular functions (e.g., `?ggplot`).
Having dates classified as separate components makes POSIXlt computationally
more resource intensive to use in data.frames. This slows things down! We will
thus use POSIXct for this tutorial.
**Data Tip:** There are other R packages that
support date-time data classes, including `readr`, `zoo` and `chron`.
Convert to Date-time Class
When we convert from a character to a date-time class we need to tell R how
the date and time information are stored in each string. To do this, we can use
format=. Let's have a look at one of our date-time strings to determine it's
format.
# view one date-time field
harMet_15Min$datetime[1]
## [1] "2005-01-01T00:15"
Looking at the results above, we see that our data are stored in the format:
Year-Month-Day "T" Hour:Minute (2005-01-01T00:15). We can use this information
to populate our format string using the following designations for the
components of the date-time data:
%Y - year
%m - month
%d - day
%H:%M - hours:minutes
**Data Tip:** A list of options for date-time format
is available in the `strptime` function help: `help(strptime)`. Check it out!
The "T" inserted into the middle of datetime isn't a standard character for
date and time, however we can tell R where the character is so it can ignore
it and interpret the correct date and time as follows:
format="%Y-%m-%dT%H:%M".
# convert single instance of date/time in format year-month-day hour:min:sec
as.POSIXct(harMet_15Min$datetime[1],format="%Y-%m-%dT%H:%M")
## [1] "2005-01-01 00:15:00 MST"
## The format of date-time MUST match the specified format or the data will not
# convert; see what happens when you try it a different way or without the "T"
# specified
as.POSIXct(harMet_15Min$datetime[1],format="%d-%m-%Y%H:%M")
## [1] NA
as.POSIXct(harMet_15Min$datetime[1],format="%Y-%m-%d%H:%M")
## [1] NA
Using the syntax we've learned, we can convert the entire datetime column into
POSIXct class.
new.date.time <- as.POSIXct(harMet_15Min$datetime,
format="%Y-%m-%dT%H:%M" #format time
)
# view output
head(new.date.time)
## [1] "2005-01-01 00:15:00 MST" "2005-01-01 00:30:00 MST"
## [3] "2005-01-01 00:45:00 MST" "2005-01-01 01:00:00 MST"
## [5] "2005-01-01 01:15:00 MST" "2005-01-01 01:30:00 MST"
# what class is the output
class(new.date.time)
## [1] "POSIXct" "POSIXt"
About Time Zones
Above, we successfully converted our data into a date-time class. However, what
timezone is the output new.date.time object that we created using?
2005-04-15 03:30:00 MDT
It appears as if our data were assigned MDT (which is the timezone of the
computer where these tutorials were processed originally - in Colorado - Mountain
Time). However, we know from the metadata, explored in the
Why Metadata Are Important: How to Work with Metadata in Text & EML Format tutorial,
that these data were stored in Eastern Standard Time.
Assign Time Zone
When we convert a date-time formatted column to POSIXct format, we need to
assign an associated time zone. If we don't assign a time zone,R will
default to the local time zone that is defined on your computer.
There are individual designations for different time zones and time zone
variants (e.g., does the time occur during daylight savings time).
**Data Tip:** Codes for time zones can be found in this
time zone table.
The code for the Eastern time zone that is the closest match to the NEON Harvard
Forest field site is America/New_York. Let's convert our datetime field
one more time, and define the associated timezone (tz=).
# assign time zone to just the first entry
as.POSIXct(harMet_15Min$datetime[1],
format = "%Y-%m-%dT%H:%M",
tz = "America/New_York")
## [1] "2005-01-01 00:15:00 EST"
The output above, shows us that the time zone is now correctly set as EST.
Convert to Date-time Data Class
Now, using the syntax that we learned above, we can convert the entire
datetime column to a POSIXct class.
# convert to POSIXct date-time class
harMet_15Min$datetime <- as.POSIXct(harMet_15Min$datetime,
format = "%Y-%m-%dT%H:%M",
tz = "America/New_York")
# view structure and time zone of the newly defined datetime column
str(harMet_15Min$datetime)
## POSIXct[1:376800], format: "2005-01-01 00:15:00" "2005-01-01 00:30:00" ...
tz(harMet_15Min$datetime)
## [1] "America/New_York"
Now that our datetime data are properly identified as a POSIXct date-time
data class we can continue on and look at the patterns of precipitation,
temperature, and PAR through time.
This tutorial covers what metadata are, and why we need to work with
metadata. It covers the 3 most common metadata formats: text file format,
web page format and Ecological Metadata Language (EML).
R Skill Level: Introduction - you've got the basics of R down and
understand the general structure of tabular data.
Learning Objectives
After completing this tutorial, you will be able to:
Import a .csv file and examine the structure of the related R
object.
Use a metadata file to better understand the content of a dataset.
Explain the importance of including metadata details in your R script.
Describe what an EML file is.
Things You’ll Need To Complete This Tutorial
You will need the most current version of R and, preferably, RStudio loaded on
your computer to complete this tutorial.
Install R Packages
When presented in a workshop, the EML package will be presented as a
demonstration!
If completing EML portion of tutorial on your own, you must
install EML directly from GitHub (the package is in development and is not
yet on CRAN). You will need to install devtools to do this.
R Script & Challenge Code: NEON data lessons often contain challenges that reinforce
learned skills. If available, the code for challenge solutions is found in the
downloadable R script of the entire lesson, available in the footer of each lesson page.
Understand Our Data
In order to work with any data, we need to understand three things about the
data:
Structure
Format
Processing methods
If the data are collected by other people and organizations, we might also need
further information about:
What metrics are included in the dataset
The units those metrics were stored in
Explanation of where the metrics are stored in the data and what they are "called"
(e.g. what are the column names in a spreadsheet)
The time range that it covers
The spatial extent that the data covers
The above information, and more are stored in metadata - data
about the data. Metadata is information that describes a dataset and is integral
to working with external data (data that we did not collect ourselves).
Metadata Formats
Metadata come in different formats. We will discuss three of those in this tutorial:
Ecological Metadata Language (EML): A standardized metadata format stored
in xml format which is machine readable. Metadata has some standards however it's
common to encounter metadata stored differently in EML files created by different
organizations.
Text file: Sometimes metadata files can be found in text files that are either
downloaded with a data product OR that are available separately for the data.
Directly on a website (HTML / XML): Sometimes data are documented directly
in text format, on a web page.
**Data Tip:** When you find metadata for a dataset
that you are working with, immediately **DOWNLOAD AND SAVE IT** to the directory
on your computer where you saved the data. It is also a good idea to document
the URL where you found the metadata and data in a "readme" text file!
Metadata Stored on a Web Page
The metadata for the data that we are working with for the Harvard Forest field
site are stored in both EML format and on a web page. Let's explore the web
page first.
Let's begin by visiting that page above. At the top of the page, there is a list
of data available for Harvard Forest. NOTE: hf001-06: daily (metric) since
2001 (preview) is the data that we used in the
previous tutorial.
Scroll down to the Overview section on the website. Take note of the
information provided in that section and answer the questions in the
Challenge below.
### Challenge: Explore Metadata
Explore the metadata stored on the Harvard Forest LTER web page. Answer the
following questions.
What is the time span of the data available for this site?
You have some questions about these data. Who is the lead investigator / who
do you contact for more information? And how do you contact them?
Where is this field site located? How is the site location information stored
in the metadata? Is there any information that may be useful to you viewing the
location? HINT: What if you were not familiar with Harvard as a site / from
another country, etc?
Field Site Information: What is the elevation for the site? What is the
dominant vegetation cover for the site? HINT: Is dominant vegetation easy to
find in the metadata?
How is snow dealt with in the precipitation data?
Are there some metadata attributes that might be useful to access in a script
in R or Python rather than viewed on a web page? HINT: Can you answer all of
the questions above from the information provided on this website? Is there
additional information that you might want to find on the page?
View Metadata For Metrics of Interest
For this tutorial series, we are interested in the drivers of plant phenology -
specifically air and soil temperature, precipitation and photosynthetically
active radiation (PAR). Let's look for descriptions of these variables in the
metadata and determine several key attributes that we will need prior to working
with the data.
### Challenge: Metrics of Interest Metadata
View the metadata at the URL above. Answer the following questions about the
Harvard Forest LTER data - hf001-10: 15-minute (metric) since 2005:
What is the column heading name where each variable (air temperature, soil
temperature, precipitation and PAR) is stored?
What are the units that each variable are stored in?
What is the frequency of measurement collected for each and how are noData
values stored?
Where is the date information stored (in what field) and what timezone are
the dates stored in?
Why Metadata on a Web Page Is Not Ideal
It is nice to have a web page that displays metadata information, however
accessing metadata on a web page is difficult:
If the web page URL changes or the site goes down, the information is lost.
It's also more challenging to access metadata in text format on a web page
programatically - like using R as an interface - which we often
want to do when working with larger datasets.
A machine readable metadata file is better - especially when we are working with
large data and we want to automate and carefully document workflows. The
Ecological Metadata Language (EML) is one machine readable metadata format.
Ecological Metadata Language (EML)
While much of the documentation that we need to work with at the Harvard Forest
field site is available directly on the
Harvard Forest Data Archive page,
the website also offers metadata in EML format.
Introduction to EML
The Ecological Metadata Language (EML) is a data specification developed
specifically
to document ecological data. An EML file is created using a XML based format.
This means that content is embedded within hierarchical tags. For example,
the title of a dataset might be embedded in a <title> tag as follows:
<title>Fisher Meteorological Station at Harvard Forest since 2001</title>
Similarly, the creator of a dataset is also be found in a hierarchical tag
structure.
The EML package for R is designed to read and allow users to work with EML
formatted metadata. In this tutorial, we demonstrate how we can use EML in an
automated workflow.
NOTE: The EML package is still being developed, therefore we will not
explicitly teach all details of how to use it. Instead, we will provide
an example of how you can access EML files programmatically and background
information so that you can further explore EML and the EML package if you
need to work with it further.
EML Terminology
Let's first discuss some basic EML terminology. In the
context of EML, each EML file documents a dataset. This dataset may
consist of one or more files that contain the data in data tables. In the
case of our tabular meteorology data, the structure of our EML file includes:
The dataset. A dataset may contain
one or more data tables associated with it that may contain different types of
related information. For this Harvard Forest meteorological data, the data
tables contain tower measurements including precipitation and temperature, that
are aggregated at various time intervals (15 minute, daily, etc) and that date
back to 2001.
The data tables. Data tables refer to the actual data that make up the
dataset. For the Harvard Forest data, each data table contains a suite of
meteorological metrics, including precipitation and temperature (and associated
quality flags), that are aggregated at a particular time interval (e.g. one
data table contains monthly average data, another contains 15 minute
averaged data, etc).
Work With EML in R
To begin, we will load the EML package directly from ROpenSci's Git repository.
# install R EML tool
# load devtools (if you need to install "EML")
#library("devtools")
# IF YOU HAVE NOT DONE SO ALREADY: install EML from github -- package in
# development; not on CRAN
#install_github("ropensci/EML", build=FALSE, dependencies=c("DEPENDS", "IMPORTS"))
# load ROpenSci EML package
library(EML)
# load ggmap for mapping
library(ggmap)
# load tmaptools for mapping
library(tmaptools)
Next, we will read in the LTER EML file - directly from the online URL using
eml_read. This file documents multiple data products that can be downloaded.
Check out the
Harvard Forest Data Archive Page for Fisher Meteorological Station
for more on this dataset and to download the archive files directly.
Note that because this EML file is large, it takes a few seconds for the file to
load.
# data location
# http://harvardforest.fas.harvard.edu:8080/exist/apps/datasets/showData.html?id=hf001
# table 4 http://harvardforest.fas.harvard.edu/data/p00/hf001/hf001-04-monthly-m.csv
# import EML from Harvard Forest Met Data
# note, for this particular tutorial, we will work with an abridged version of the file
# that you can access directly on the harvard forest website. (see comment below)
eml_HARV <- read_eml("https://harvardforest1.fas.harvard.edu/sites/harvardforest.fas.harvard.edu/files/data/eml/hf001.xml")
# import a truncated version of the eml file for quicker demonstration
# eml_HARV <- read_eml("http://neonscience.github.io/NEON-R-Tabular-Time-Series/hf001-revised.xml")
# view size of object
object.size(eml_HARV)
## 1299568 bytes
# view the object class
class(eml_HARV)
## [1] "emld" "list"
The eml_read function creates an EML class object. This object can be
accessed using slots in R ($) rather than a typical subset [] approach.
Explore Metadata Attributes
We can begin to explore the contents of our EML file and associated data that it
describes. Let's start at the dataset level. We can use slots to view
the contact information for the dataset and a description of the methods.
# view the contact name listed in the file
eml_HARV$dataset$creator
## $individualName
## $individualName$givenName
## [1] "Emery"
##
## $individualName$surName
## [1] "Boose"
# view information about the methods used to collect the data as described in EML
eml_HARV$dataset$methods
## $methodStep
## $methodStep$description
## $methodStep$description$section
## $methodStep$description$section[[1]]
## [1] "<title>Observation periods</title><para>15-minute: 15 minutes, ending with given time. Hourly: 1 hour, ending with given time. Daily: 1 day, midnight to midnight. All times are Eastern Standard Time.</para>"
##
## $methodStep$description$section[[2]]
## [1] "<title>Instruments</title><para>Air temperature and relative humidity: Vaisala HMP45C (2.2m above ground). Precipitation: Met One 385 heated rain gage (top of gage 1.6m above ground). Global solar radiation: Licor LI200X pyranometer (2.7m above ground). PAR radiation: Licor LI190SB quantum sensor (2.7m above ground). Net radiation: Kipp and Zonen NR-LITE net radiometer (5.0m above ground). Barometric pressure: Vaisala CS105 barometer. Wind speed and direction: R.M. Young 05103 wind monitor (10m above ground). Soil temperature: Campbell 107 temperature probe (10cm below ground). Data logger: Campbell Scientific CR10X.</para>"
##
## $methodStep$description$section[[3]]
## [1] "<title>Instrument and flag notes</title><para>Air temperature. Daily air temperature is estimated from other stations as needed to complete record.</para><para>Precipitation. Daily precipitation is estimated from other stations as needed to complete record. Delayed melting of snow and ice (caused by problems with rain gage heater or heavy precipitation) is noted in log - daily values are corrected if necessary but 15-minute values are not. The gage may underestimate actual precipitation under windy or cold conditions.</para><para>Radiation. Whenever possible, snow and ice are removed from radiation instruments after precipitation events. Depth of snow or ice on instruments and time of removal are noted in log, but values are not corrected or flagged.</para><para>Wind speed and direction. During ice storms, values are flagged as questionable when there is evidence (from direct observation or the 15-minute record) that ice accumulation may have affected the instrument's operation.</para>"
Identify & Map Data Location
Looking at the coverage for our data, there is only one unique x and y value.
This suggests that our data were collected at (x,y) one point location. We know
this is a tower so a point location makes sense. Let's grab the x and y
coordinates and create a quick context map. We will use ggmap to create our
map.
NOTE: If this were a rectangular extent, we'd want the bounding box not just
the point. This is important if the data in raster, HDF5, or a similar format.
We need the extent to properly geolocate and process the data.
# grab x coordinate from the coverage information
XCoord <- eml_HARV$dataset$coverage$geographicCoverage$boundingCoordinates$westBoundingCoordinate[[1]]
# grab y coordinate from the coverage information
YCoord <- eml_HARV$dataset$coverage$geographicCoverage$boundingCoordinates$northBoundingCoordinate[[1]]
# make a map and add the xy coordinates to it
ggmap(get_stamenmap(rbind(as.numeric(paste(geocode_OSM("Massachusetts")$bbox))), zoom = 11, maptype=c("toner")), extent=TRUE) + geom_point(aes(x=as.numeric(XCoord),y=as.numeric(YCoord)),
color="darkred", size=6, pch=18)
The above example, demonstrated how we can extract information from an EML
document and use it programatically in R! This is just the beginning of what
we can do!
Metadata For Your Own Data
Now, imagine that you are working with someone else's data and you don't have a
metadata file associated with it? How do you know what units the data were in?
How the data were collected? The location that the data covers? It is equally
important to create metadata for your own data, to make your data more
efficiently "shareable".
This tutorial will demonstrate how to import a time series dataset stored in .csv
format into R. It will explore data classes for columns in a data.frame and
will walk through how to
convert a date, stored as a character string, into a date class that R can
recognize and plot efficiently.
Learning Objectives
After completing this tutorial, you will be able to:
Open a .csv file in R using read.csv()and understand why we
are using that file type.
Work with data stored in different columns within a data.frame in R.
Examine R object structures and data classes.
Convert dates, stored as a character class, into an R date
class.
Create a quick plot of a time-series dataset using qplot.
Things You’ll Need To Complete This Tutorial
You will need the most current version of R and, preferably, RStudio loaded on your computer to complete this tutorial.
R Script & Challenge Code: NEON data lessons often contain challenges that reinforce
learned skills. If available, the code for challenge solutions is found in the
downloadable R script of the entire lesson, available in the footer of each lesson page.
Data Related to Phenology
In this tutorial, we will explore atmospheric data (including temperature,
precipitation and other metrics) collected by sensors mounted on a
flux tower
at the NEON Harvard Forest field site. We are interested in exploring
changes in temperature, precipitation, Photosynthetically Active Radiation (PAR) and day
length throughout the year -- metrics that impact changes in the timing of plant
phenophases (phenology).
About .csv Format
The data that we will use is in .csv (comma-separated values) file format. The
.csv format is a plain text format, where each value in the dataset is
separate by a comma and each "row" in the dataset is separated by a line break.
Plain text formats are ideal for working both across platforms (Mac, PC, LINUX,
etc) and also can be read by many different tools. The plain text
format is also less likely to become obsolete over time.
To begin, let's import the data into R. We can use base R functionality
to import a .csv file. We will use the ggplot2 package to plot our data.
# Load packages required for entire script.
# library(PackageName) # purpose of package
library(ggplot2) # efficient, pretty plotting - required for qplot function
# set working directory to ensure R can find the file we wish to import
# provide the location for where you've unzipped the lesson data
wd <- "~/Git/data/"
**Data Tip:** Good coding practice -- install and
load all libraries at top of script.
If you decide you need another package later on in the script, return to this
area and add it. That way, with a glance, you can see all packages used in a
given script.
Once our working directory is set, we can import the file using read.csv().
# Load csv file of daily meteorological data from Harvard Forest
harMet.daily <- read.csv(
file=paste0(wd,"NEON-DS-Met-Time-Series/HARV/FisherTower-Met/hf001-06-daily-m.csv"),
stringsAsFactors = FALSE
)
stringsAsFactors=FALSE
When reading in files we most often use stringsAsFactors = FALSE. This
setting ensures that non-numeric data (strings) are not converted to
factors.
What Is A Factor?
A factor is similar to a category. However factors can be numerically interpreted
(they can have an order) and may have a level associated with them.
Examples of factors:
Month Names (an ordinal variable): Month names are non-numerical but we know
that April (month 4) comes after March (month 3) and each could be represented
by a number (4 & 3).
1 and 2s to represent male and female sex (a nominal variable): Numerical
interpretation of non-numerical data but no order to the levels.
After loading the data it is easy to convert any field that should be a factor by
using as.factor(). Therefore it is often best to read in a file with
stringsAsFactors = FALSE.
Data.Frames in R
The read.csv() imports our .csv into a data.frame object in R. data.frames
are ideal for working with tabular data - they are similar to a spreadsheet.
# what type of R object is our imported data?
class(harMet.daily)
## [1] "data.frame"
Data Structure
Once the data are imported, we can explore their structure. There are several
ways to examine the structure of a data frame:
head(): shows us the first 6 rows of the data (tail() shows the last 6
rows).
str() : displays the structure of the data as R interprets it.
Let's use both to explore our data.
# view first 6 rows of the dataframe
head(harMet.daily)
## date jd airt f.airt airtmax f.airtmax airtmin f.airtmin rh
## 1 2001-02-11 42 -10.7 -6.9 -15.1 40
## 2 2001-02-12 43 -9.8 -2.4 -17.4 45
## 3 2001-02-13 44 -2.0 5.7 -7.3 70
## 4 2001-02-14 45 -0.5 1.9 -5.7 78
## 5 2001-02-15 46 -0.4 2.4 -5.7 69
## 6 2001-02-16 47 -3.0 1.3 -9.0 82
## f.rh rhmax f.rhmax rhmin f.rhmin dewp f.dewp dewpmax f.dewpmax
## 1 58 22 -22.2 -16.8
## 2 85 14 -20.7 -9.2
## 3 100 34 -7.6 -4.6
## 4 100 59 -4.1 1.9
## 5 100 37 -6.0 2.0
## 6 100 46 -5.9 -0.4
## dewpmin f.dewpmin prec f.prec slrt f.slrt part f.part netr f.netr
## 1 -25.7 0.0 14.9 NA M NA M
## 2 -27.9 0.0 14.8 NA M NA M
## 3 -10.2 0.0 14.8 NA M NA M
## 4 -10.2 6.9 2.6 NA M NA M
## 5 -12.1 0.0 10.5 NA M NA M
## 6 -10.6 2.3 6.4 NA M NA M
## bar f.bar wspd f.wspd wres f.wres wdir f.wdir wdev f.wdev gspd
## 1 1025 3.3 2.9 287 27 15.4
## 2 1033 1.7 0.9 245 55 7.2
## 3 1024 1.7 0.9 278 53 9.6
## 4 1016 2.5 1.9 197 38 11.2
## 5 1010 1.6 1.2 300 40 12.7
## 6 1016 1.1 0.5 182 56 5.8
## f.gspd s10t f.s10t s10tmax f.s10tmax s10tmin f.s10tmin
## 1 NA M NA M NA M
## 2 NA M NA M NA M
## 3 NA M NA M NA M
## 4 NA M NA M NA M
## 5 NA M NA M NA M
## 6 NA M NA M NA M
# View the structure (str) of the data
str(harMet.daily)
## 'data.frame': 5345 obs. of 46 variables:
## $ date : chr "2001-02-11" "2001-02-12" "2001-02-13" "2001-02-14" ...
## $ jd : int 42 43 44 45 46 47 48 49 50 51 ...
## $ airt : num -10.7 -9.8 -2 -0.5 -0.4 -3 -4.5 -9.9 -4.5 3.2 ...
## $ f.airt : chr "" "" "" "" ...
## $ airtmax : num -6.9 -2.4 5.7 1.9 2.4 1.3 -0.7 -3.3 0.7 8.9 ...
## $ f.airtmax: chr "" "" "" "" ...
## $ airtmin : num -15.1 -17.4 -7.3 -5.7 -5.7 -9 -12.7 -17.1 -11.7 -1.3 ...
## $ f.airtmin: chr "" "" "" "" ...
## $ rh : int 40 45 70 78 69 82 66 51 57 62 ...
## $ f.rh : chr "" "" "" "" ...
## $ rhmax : int 58 85 100 100 100 100 100 71 81 78 ...
## $ f.rhmax : chr "" "" "" "" ...
## $ rhmin : int 22 14 34 59 37 46 30 34 37 42 ...
## $ f.rhmin : chr "" "" "" "" ...
## $ dewp : num -22.2 -20.7 -7.6 -4.1 -6 -5.9 -10.8 -18.5 -12 -3.5 ...
## $ f.dewp : chr "" "" "" "" ...
## $ dewpmax : num -16.8 -9.2 -4.6 1.9 2 -0.4 -0.7 -14.4 -4 0.6 ...
## $ f.dewpmax: chr "" "" "" "" ...
## $ dewpmin : num -25.7 -27.9 -10.2 -10.2 -12.1 -10.6 -25.4 -25 -16.5 -5.7 ...
## $ f.dewpmin: chr "" "" "" "" ...
## $ prec : num 0 0 0 6.9 0 2.3 0 0 0 0 ...
## $ f.prec : chr "" "" "" "" ...
## $ slrt : num 14.9 14.8 14.8 2.6 10.5 6.4 10.3 15.5 15 7.7 ...
## $ f.slrt : chr "" "" "" "" ...
## $ part : num NA NA NA NA NA NA NA NA NA NA ...
## $ f.part : chr "M" "M" "M" "M" ...
## $ netr : num NA NA NA NA NA NA NA NA NA NA ...
## $ f.netr : chr "M" "M" "M" "M" ...
## $ bar : int 1025 1033 1024 1016 1010 1016 1008 1022 1022 1017 ...
## $ f.bar : chr "" "" "" "" ...
## $ wspd : num 3.3 1.7 1.7 2.5 1.6 1.1 3.3 2 2.5 2 ...
## $ f.wspd : chr "" "" "" "" ...
## $ wres : num 2.9 0.9 0.9 1.9 1.2 0.5 3 1.9 2.1 1.8 ...
## $ f.wres : chr "" "" "" "" ...
## $ wdir : int 287 245 278 197 300 182 281 272 217 218 ...
## $ f.wdir : chr "" "" "" "" ...
## $ wdev : int 27 55 53 38 40 56 24 24 31 27 ...
## $ f.wdev : chr "" "" "" "" ...
## $ gspd : num 15.4 7.2 9.6 11.2 12.7 5.8 16.9 10.3 11.1 10.9 ...
## $ f.gspd : chr "" "" "" "" ...
## $ s10t : num NA NA NA NA NA NA NA NA NA NA ...
## $ f.s10t : chr "M" "M" "M" "M" ...
## $ s10tmax : num NA NA NA NA NA NA NA NA NA NA ...
## $ f.s10tmax: chr "M" "M" "M" "M" ...
## $ s10tmin : num NA NA NA NA NA NA NA NA NA NA ...
## $ f.s10tmin: chr "M" "M" "M" "M" ...
**Data Tip:** You can adjust the number of rows
returned when using the `head()` and `tail()` functions. For example you can use
`head(harMet.daily, 10)` to display the first 10 rows of your data rather than 6.
Classes in R
The structure results above let us know that the attributes in our data.frame
are stored as several different data types or classes as follows:
chr - Character: It holds strings that are composed of letters and
words. Character class data can not be interpreted numerically - that is to say
we can not perform math on these values even if they contain only numbers.
int - Integer: It holds numbers that are whole integers without decimals.
Mathematical operations can be performed on integers.
num - Numeric: It accepts data that are a wide variety of numeric formats
including decimals (floating point values) and integers. Numeric also accept
larger numbers than int will.
Storing variables using different classes is a strategic decision by R (and
other programming languages) that optimizes processing and storage. It allows:
data to be processed more quickly & efficiently.
the program (R) to minimize the storage size.
Differences Between Classes
Certain functions can be performed on certain data classes and not on others.
For example:
a <- "mouse"
b <- "sparrow"
class(a)
## [1] "character"
class(b)
## [1] "character"
# subtract a-b
a-b
## Error in a - b: non-numeric argument to binary operator
You can not subtract two character values given they are not numbers.
Additionally, performing summary statistics and other calculations of different
types of classes can yield different results.
# create a new object
speciesObserved <- c("speciesb","speciesc","speciesa")
speciesObserved
## [1] "speciesb" "speciesc" "speciesa"
# determine the class
class(speciesObserved)
## [1] "character"
# calculate the minimum
min(speciesObserved)
## [1] "speciesa"
# create numeric object
prec <- c(1,2,5,3,6)
# view class
class(prec)
## [1] "numeric"
# calculate min value
min(prec)
## [1] 1
We can calculate the minimum value for SpeciesObserved, a character
data class, however it does not return a quantitative minimum. It simply
looks for the first element, using alphabetical (rather than numeric) order.
Yet, we can calculate the quantitative minimum value for prec a numeric
data class.
Plot Data Using qplot()
Now that we've got classes down, let's plot one of the metrics in our data,
air temperature -- airt. Given this is a time series dataset, we want to plot
air temperature as it changes over time. We have a date-time column, date, so
let's use that as our x-axis variable and airt as our y-axis variable.
We will use the qplot() (for quick plot) function in the ggplot2 package.
The syntax for qplot() requires the x- and y-axis variables and then the R
object that the variables are stored in.
**Data Tip:** Add a title to the plot using
`main="Title string"`.
# quickly plot air temperature
qplot(x=date, y=airt,
data=harMet.daily,
main="Daily Air Temperature\nNEON Harvard Forest Field Site")
We have successfully plotted some data. However, what is happening on the
x-axis?
R is trying to plot EVERY date value in our data, on the x-axis. This makes it
hard to read. Why? Let's have a look at the class of the x-axis variable - date.
# View data class for each column that we wish to plot
class(harMet.daily$date)
## [1] "character"
class(harMet.daily$airt)
## [1] "numeric"
In this case, the date column is stored in our data.frame as a character
class. Because it is a character, R does not know how to plot the dates as a
continuous variable. Instead it tries to plot every date value as a text string.
The airt data class is numeric so that metric plots just fine.
Date as a Date-Time Class
We need to convert our date column, which is currently stored as a character
to a date-time class that can be displayed as a continuous variable. Lucky
for us, R has a date class. We can convert the date field to a date class
using as.Date().
# convert column to date class
harMet.daily$date <- as.Date(harMet.daily$date)
# view R class of data
class(harMet.daily$date)
## [1] "Date"
# view results
head(harMet.daily$date)
## [1] "2001-02-11" "2001-02-12" "2001-02-13" "2001-02-14" "2001-02-15"
## [6] "2001-02-16"
Now that we have adjusted the date, let's plot again. Notice that it plots
much more quickly now that R recognizes date as a date class. R can
aggregate ticks on the x-axis by year instead of trying to plot every day!
# quickly plot the data and include a title using main=""
# In title string we can use '\n' to force the string to break onto a new line
qplot(x=date,y=airt,
data=harMet.daily,
main="Daily Air Temperature w/ Date Assigned\nNEON Harvard Forest Field Site")
### Challenge: Using ggplot2's qplot function
Create a quick plot of the precipitation. Use the full time frame of data available
in the harMet.daily object.
Do precipitation and air temperature have similar annual patterns?
Create a quick plot examining the relationship between air temperature and precipitation.
Hint: you can modify the X and Y axis labels using xlab="label text" and
ylab="label text".
This tutorial defines Julian (year) day as most often used in an ecological
context, explains why Julian days are useful for analysis and plotting, and
teaches how to create a Julian day variable from a Date or Data/Time class
variable.
Learning Objectives
After completing this tutorial, you will be able to:
Define a Julian day (year day) as used in most ecological
contexts.
Convert a Date or Date/Time class variable to a Julian day
variable.
Things You’ll Need To Complete This Tutorial
You will need the most current version of R and, preferably, RStudio loaded on your computer to complete this tutorial.
R Script & Challenge Code: NEON data lessons often contain challenges that reinforce
learned skills. If available, the code for challenge solutions is found in the
downloadable R script of the entire lesson, available in the footer of each lesson page.
Convert Between Time Formats - Julian Days
Julian days, as most often used in an ecological context, is a continuous count
of the number of days beginning at Jan 1 each year. Thus each year will have up
to 365 (non-leap year) or 366 (leap year) days.
**Data Note:** This format can also be called ordinal
day or year day. In some contexts, Julian day can refer specifically to a
numeric day count since 1 January 4713 BCE or as a count from some other origin
day, instead of an annual count of 365 or 366 days.
Including a Julian day variable in your dataset can be very useful when
comparing data across years, when plotting data, and when matching your data
with other types of data that include Julian day.
Load the Data
Load this dataset that we will use to convert a date into a year day or Julian
day.
Notice the date is read in as a character and must first be converted to a Date
class.
# Load packages required for entire script
library(lubridate) #work with dates
# set working directory to ensure R can find the file we wish to import
wd <- "~/Git/data/"
# Load csv file of daily meteorological data from Harvard Forest
# Factors=FALSE so strings, series of letters/ words/ numerals, remain characters
harMet_DailyNoJD <- read.csv(
file=paste0(wd,"NEON-DS-Met-Time-Series/HARV/FisherTower-Met/hf001-06-daily-m-NoJD.csv"),
stringsAsFactors = FALSE
)
# what data class is the date column?
str(harMet_DailyNoJD$date)
## chr [1:5345] "2/11/01" "2/12/01" "2/13/01" "2/14/01" "2/15/01" ...
# convert "date" from chr to a Date class and specify current date format
harMet_DailyNoJD$date<- as.Date(harMet_DailyNoJD$date, "%m/%d/%y")
Convert with yday()
To quickly convert from from Date to Julian days, can we use yday(), a
function from the lubridate package.
# to learn more about it type
?yday
We want to create a new column in the existing data frame, titled julian, that
contains the Julian day data.
# convert with yday into a new column "julian"
harMet_DailyNoJD$julian <- yday(harMet_DailyNoJD$date)
# make sure it worked all the way through.
head(harMet_DailyNoJD$julian)
## [1] 42 43 44 45 46 47
tail(harMet_DailyNoJD$julian)
## [1] 268 269 270 271 272 273
**Data Tip:** In this tutorial we converted from
`Date` class to a Julian day, however, you can convert between any recognized
date/time class (POSIXct, POSIXlt, etc) and Julian day using `yday`.
This tutorial reviews how to plot a raster in R using the plot()
function. It also covers how to layer a raster on top of a hillshade to produce
an eloquent map.
Learning Objectives
After completing this tutorial, you will be able to:
Know how to plot a single band raster in R.
Know how to layer a raster dataset on top of a hillshade to create an elegant
basemap.
Things You’ll Need To Complete This Tutorial
You will need the most current version of R and, preferably, RStudio loaded
on your computer to complete this tutorial.
Set Working Directory: This lesson will explain how to set the working directory. You may wish to set your working directory to some other location, depending on how you prefer to organize your data.
In this tutorial, we will plot the Digital Surface Model (DSM) raster
for the NEON Harvard Forest Field Site. We will use the hist() function as a
tool to explore raster values. And render categorical plots, using the breaks
argument to get bins that are meaningful representations of our data.
We will use the terra package in this tutorial. If you do not
have the DSM_HARV variable as defined in the pervious tutorial, Intro To Raster In R, please download it using neonUtilities, as shown in the previous tutorial.
library(terra)
# set working directory
wd <- "~/data/"
setwd(wd)
# import raster into R
dsm_harv_file <- paste0(wd, "DP3.30024.001/neon-aop-products/2022/FullSite/D01/2022_HARV_7/L3/DiscreteLidar/DSMGtif/NEON_D01_HARV_DP3_732000_4713000_DSM.tif")
DSM_HARV <- rast(dsm_harv_file)
First, let's plot our Digital Surface Model object (DSM_HARV) using the
plot() function. We add a title using the argument main="title".
# Plot raster object
plot(DSM_HARV, main="Digital Surface Model - HARV")
Plotting Data Using Breaks
We can view our data "symbolized" or colored according to ranges of values
rather than using a continuous color ramp. This is comparable to a "classified"
map. However, to assign breaks, it is useful to first explore the distribution
of the data using a histogram. The breaks argument in the hist() function
tells R to use fewer or more breaks or bins.
If we name the histogram, we can also view counts for each bin and assigned
break values.
# Plot distribution of raster values
DSMhist<-hist(DSM_HARV,
breaks=3,
main="Histogram Digital Surface Model\n NEON Harvard Forest Field Site",
col="lightblue", # changes bin color
xlab= "Elevation (m)") # label the x-axis
# Where are breaks and how many pixels in each category?
DSMhist$breaks
## [1] 300 350 400 450
DSMhist$counts
## [1] 355269 611685 33046
Looking at our histogram, R has binned out the data as follows:
300-350m, 350-400m, 400-450m
We can determine that most of the pixel values fall in the 350-400m range with a
few pixels falling in the lower and higher range. We could specify different
breaks, if we wished to have a different distribution of pixels in each bin.
We can use those bins to plot our raster data. We will use the
terrain.colors() function to create a palette of 3 colors to use in our plot.
The breaks argument allows us to add breaks. To specify where the breaks
occur, we use the following syntax: breaks=c(value1,value2,value3).
We can include as few or many breaks as we'd like.
# plot using breaks.
plot(DSM_HARV,
breaks = c(300, 350, 400, 450),
col = terrain.colors(3),
main="Digital Surface Model (DSM) - HARV")
Data Tip: Note that when we assign break values
a set of 4 values will result in 3 bins of data.
Format Plot
If we need to create multiple plots using the same color palette, we can create
an R object (myCol) for the set of colors that we want to use. We can then
quickly change the palette across all plots by simply modifying the myCol
object.
We can label the x- and y-axes of our plot too using xlab and ylab.
# Assign color to a object for repeat use/ ease of changing
myCol = terrain.colors(3)
# Add axis labels
plot(DSM_HARV,
breaks = c(300, 350, 400, 450),
col = myCol,
main="Digital Surface Model - HARV",
xlab = "UTM Easting (m)",
ylab = "UTM Northing (m)")
Or we can also turn off the axes altogether.
# or we can turn off the axis altogether
plot(DSM_HARV,
breaks = c(300, 350, 400, 450),
col = myCol,
main="Digital Surface Model - HARV",
axes=FALSE)
Challenge: Plot Using Custom Breaks
Create a plot of the Harvard Forest Digital Surface Model (DSM) that has:
Six classified ranges of values (break points) that are evenly divided among
the range of pixel values.
Axis labels
A plot title
Hillshade & Layering Rasters
The terra package has built-in functions called terrain for calculating
slope and aspect, and shade for computing hillshade from the slope and aspect.
A hillshade is a raster that maps the shadows and texture that you would see
from above when viewing terrain.
The alpha value determines how transparent the colors will be (0 being
transparent, 1 being opaque). You can also change the color palette, for example,
use rainbow() instead of terrain.color().
For a full tutorial on rasters & raster data, please go through the
Intro to Raster Data in R tutorial
which provides a foundational concepts even if you aren't working with R.
A raster is a dataset made up of cells or pixels. Each pixel represents a value
associated with a region on the earth’s surface.
The spatial resolution of a raster refers the size of each cell
in meters. This size in turn relates to the area on the ground that the pixel
represents. Source: National Ecological Observatory Network
There are several ways that we can get from the point data collected by lidar to
the surface data that we want for different Digital Elevation Models or similar
data we use in analyses and mapping.
Basic gridding does not allow for the recovery/classification of data in any area
where data are missing. Interpolation (including Triangulated Irregular Network
(TIN) Interpolation) allows for gaps to be covered so that there are not holes
in the resulting raster surface.
Interpolation can be done in a number of different ways, some of which are
deterministic and some are probabilistic.
When converting a set of sample points to a grid, there are many
different approaches that should be considered. Source: National Ecological
Observatory Network
Gridding Points
When creating a raster, you may chose to perform a direct gridding of the data.
This means that you calculate one value for every cell in the raster where there
are sample points. This value may be a mean of all points, a max, min or some other
mathematical function. All other cells will then have no data values associated with
them. This means you may have gaps in your data if the point spacing is not well
distributed with at least one data point within the spatial coverage of each raster
cell.
When you directly grid a dataset, values will only be calculated
for cells that overlap with data points. Thus, data gaps will not be filled.
Source: National Ecological Observatory Network
We can create a raster from points through a process called gridding. Gridding is the process of taking a set of points and using them to create a surface composed of a regular grid.
Animation showing the general process of taking lidar point
clouds and converting them to a raster format. Source: Tristan Goulden,
National Ecological Observatory Network
Spatial Interpolation
Spatial interpolation involves calculating the value for a query point (or
a raster cell) with an unknown value from a set of known sample point values that
are distributed across an area. There is a general assumption that points closer
to the query point are more strongly related to that cell than those farther away.
However this general assumption is applied differently across different
interpolation functions.
Interpolation methods will estimate values for cells where no known values exist.
Deterministic & Probabilistic Interpolators
There are two main types of interpolation approaches:
Deterministic: create surfaces directly from measured points using a
weighted distance or area function.
Probabilistic (Geostatistical): utilize the statistical properties of the
measured points. Probabilistic techniques quantify the spatial auto-correlation
among measured points and account for the spatial configuration of the sample
points around the prediction location.
Different methods of interpolation are better for different datasets. This table
lays out the strengths of some of the more common interpolation methods.
We will focus on deterministic methods in this tutorial.
Deterministic Interpolation Methods
Let's look at a few different deterministic interpolation methods to understand
how different methods can affect an output raster.
Inverse Distance Weighted (IDW)
Inverse distance weighted (IDW) interpolation calculates the values of a query
point (a cell with an unknown value) using a linearly weighted combination of values
from nearby points.
IDW interpolation calculates the value of an unknown cell center value (a query point) using surrounding points with the assumption that closest points
will be the most similar to the value of interest. Source: QGIS
Key Attributes of IDW Interpolation
The raster is derived based upon an assumed linear relationship between the
location of interest and the distance to surrounding sample points. In other
words, sample points closest to the cell of interest are assumed to be more related
to its value than those further away.ID
Bounded/exact estimation, hence can not interpolate beyond the min/max range
of data point values. This estimate within the range of existing sample point
values can yield "flattened" peaks and valleys -- especially if the data didn't
capture those high and low points.
Interpolated points are average values
Good for point data that are equally distributed and dense. Assumes a consistent
trend or relationship between points and does not accommodate trends within the
data(e.g. east to west, elevational, etc).
IDW interpolation looks at the linear distance between the unknown value and surrounding points. Source: J. Abrecht, CUNY
Power
The power value changes the "weighting" of the IDW interpolation by specifying
how strongly points further away from the query point impact the calculated value
for that point. Power values range from 0-3+ with a default settings generally
being 2. A larger power value produces a more localized result - values further
away from the cell have less impact on it's calculated value, values closer to
the cell impact it's value more. A smaller power value produces a more averaged
result where sample points further away from the cell have a greater impact on
the cell's calculated value.
lower power values more averaged result, potential for a smoother surface.
As power decreases, the influence of sample points is larger. This yields a smoother
surface that is more averaged.
greater power values: more localized result, potential for more peaks and
valleys around sample point locations. As power increases, the influence of
sample points falls off more rapidly with distance. The query cell values become
more localized and less averaged.
IDW Take Home Points
IDW is good for:
Data whose distribution is strongly (and linearly) correlated with
distance. For example, noise falls off very predictably with distance.
Providing explicit control over the influence of distance (compared to Spline
or Kriging).
IDW is not so good for:
Data whose distribution depends on more complex sets of variables
because it can account only for the effects of distance.
Other features:
You can create a smoother surface by decreasing the power, increasing the
number of sample points used or increasing the search (sample points) radius.
You can create a surface that more closely represents the peaks and dips of
your sample points by decreasing the number of sample points used, decreasing
the search radius or increasing the power.
You can increase IDW surface accuracy by adding breaklines to the
interpolation process that serve as barriers. Breaklines represent abrupt
changes in elevation, such as cliffs.
Spline
Spline interpolation fits a curved surface through the sample points of your
dataset. Imagine stretching a rubber sheet across your points and gluing it to
each sample point along the way -- what you get is a smooth stretched sheet with
bumps and valleys. Unlike IDW, spline can estimate values above and below the
min and max values of your sample points. Thus it is good for estimating high
and low values not already represented in your data.
Estimating values outside of the range of sample input data.
Creating a smooth continuous surface.
Spline is not so good for:
Points that are close together and have large value differences. Slope calculations can yield over and underestimation.
Data with large, sudden changes in values that need to be represented (e.g., fault lines, extreme vertical topographic changes, etc). NOTE: some tools like ArcGIS have introduced a spline with barriers function in recent years.
Natural Neighbor Interpolation
Natural neighbor interpolation finds the closest subset of data points to the
query point of interest. It then applies weights to those points to calculate an
average estimated value based upon their proportionate areas derived from their
corresponding
Voronoi polygons
(see figure below for definition). The natural neighbor interpolator adapts
locally to the input data using points surrounding the query point of interest.
Thus there is no radius, number of points or other settings needed when using
this approach.
This interpolation method works equally well on regular and irregularly spaced data.
A Voronoi diagram is created by taking pairs of points that are close together and drawing a line that is equidistant between them and perpendicular to the line connecting them. Source: Wikipedia
Natural neighbor interpolation uses the area of each Voronoi polygon associated
with the surrounding points to derive a weight that is then used to calculate an
estimated value for the query point of interest.
To calculate the weighted area around a query point, a secondary Voronoi diagram
is created using the immediately neighboring points and mapped on top of the
original Voronoi diagram created using the known sample points (image below).
A secondary Voronoi diagram is created using the immediately neighboring points and mapped on top of the original Voronoi diagram created using the
known sample points to created a weighted Natural neighbor interpolated raster.
Image Source: ESRI
Data where spatial distribution is variable (and data that are equally distributed).
Categorical data.
Providing a smoother output raster.
Natural Neighbor Interpolation is not as good for:
Data where the interpolation needs to be spatially constrained (to a particular number of points of distance).
Data where sample points further away from or beyond the immediate "neighbor points" need to be considered in the estimation.
Other features:
Good as a local interpolator
Interpolated values fall within the range of values of the sample data
Surface passes through input samples; not above or below
Supports breaklines
Triangulated Irregular Network (TIN)
The Triangulated Irregular Network (TIN) is a vector based surface where sample
points (nodes) are connected by a series of edges creating a triangulated surface.
The TIN format remains the most true to the point distribution, density and
spacing of a dataset. It also may yield the largest file size!
A TIN creating from LiDAR data collected by the NEON AOP over
the NEON San Joaquin (SJER) field site.
For more on the TIN process see this information from ESRI:
Overview of TINs
Interpolation in R, GrassGIS, or QGIS
These additional resources point to tools and information for gridding in R, GRASS GIS and QGIS.
R functions
The packages and functions maybe useful when creating grids in R.
gstat::idw()
stats::loess()
akima::interp()
fields::Tps()
fields::splint()
spatial::surf.ls()
geospt::rbf()
QGIS tools
Check out the documentation on different interpolation plugins
Interpolation
These hyperspectral remote sensing data provide information on the National Ecological Observatory Network'sSan Joaquin Experimental Range (SJER) field site in March of 2021. The data used in this lesson is the 1km by 1km mosaic tile named NEON_D17_SJER_DP3_257000_4112000_reflectance.h5. If you already completed the previous lesson in this tutorial series, you do not need to download this data again. The entire SJER reflectance dataset can be accessed from the NEON Data Portal.
Set Working Directory: This lesson assumes that you have set your working directory to the location of the downloaded and unzipped data subsets.
For this tutorial, you should be comfortable reading data from a HDF5 file and have a general familiarity with hyperspectral data. If you aren't familiar with these steps already, we highly recommend you work through the Introduction to Working with Hyperspectral Data in HDF5 Format in R tutorial before moving on to this tutorial.
Everything on our planet reflects electromagnetic radiation from the Sun, and different types of land cover often have dramatically different reflectance properties across the spectrum. One of the most powerful aspects of the NEON Imaging Spectrometer (NIS, or hyperspectral sensor) is that it can accurately measure these reflectance properties at a very high spectral resolution. When you plot the reflectance values across the observed spectrum, you will see
that different land cover types (vegetation, pavement, bare soils, etc.) have distinct patterns in their reflectance values, a property that we call the
'spectral signature' of a particular land cover class.
In this tutorial, we will extract the reflectance values for all bands of a single pixel to plot a spectral signature for that pixel. In order to do this,
we need to pair the reflectance values for that pixel with the wavelength values of the bands that are represented in those measurements. We will also need to adjust the reflectance values by the scaling factor that is saved as an 'attribute' in the HDF5 file. First, let's start by defining the working
directory and reading in the example dataset.
# Call required packages
library(rhdf5)
library(plyr)
library(ggplot2)
library(neonUtilities)
wd <- "~/data/" #This will depend on your local environment
setwd(wd)
If you haven't already downloaded the hyperspectral data tile (in one of the previous tutorials in this series), you can use the neonUtilities function byTileAOP to download a single reflectance tile. You can run help(byTileAOP) to see more details on what the various inputs are. For this exercise, we'll specify the UTM Easting and Northing to be (257500, 4112500), which will download the tile with the lower left corner (257000, 4112000).
byTileAOP(dpID = 'DP3.30006.001',
site = 'SJER',
year = '2021',
easting = 257500,
northing = 4112500,
savepath = wd)
This file will be downloaded into a nested subdirectory under the ~/data folder (your working directory), inside a folder named DP3.30006.001 (the Data Product ID). The file should show up in this location: ~/data/DP3.30006.001/neon-aop-products/2021/FullSite/D17/2021_SJER_5/L3/Spectrometer/Reflectance/NEON_D17_SJER_DP3_257000_4112000_reflectance.h5.
Now we can read in the file and look at the contents using h5ls. You can move this file to a different location, but make sure to change the path accordingly.
Next, let's read in the wavelength center associated with each band in the HDF5 file. We will later match these with the reflectance values and show both in our final spectral signature plot.
# read in the wavelength information from the HDF5 file
wavelengths <- h5read(h5_file,"/SJER/Reflectance/Metadata/Spectral_Data/Wavelength")
Extract Z-dimension data slice
Next, we will extract all reflectance values for one pixel. This makes up the spectral signature or profile of the pixel. To do that, we'll use the h5read() function. Here we pick an arbitrary pixel at (100,35), and use the NULL value to select all bands from that location.
# extract all bands from a single pixel
aPixel <- h5read(h5_file,"/SJER/Reflectance/Reflectance_Data",index=list(NULL,100,35))
# The line above generates a vector of reflectance values.
# Next, we reshape the data and turn them into a dataframe
b <- adply(aPixel,c(1))
# create clean data frame
aPixeldf <- b[2]
# add wavelength data to matrix
aPixeldf$Wavelength <- wavelengths
head(aPixeldf)
## V1 Wavelength
## 1 206 381.6035
## 2 266 386.6132
## 3 274 391.6229
## 4 297 396.6327
## 5 236 401.6424
## 6 236 406.6522
Scale Factor
Then, we can pull the spatial attributes that we'll need to adjust the reflectance values. Often, large raster data contain floating point (values with decimals) information. However, floating point data consume more space (yield a larger file size) compared to integer values. Thus, to keep the file sizes smaller, the data will be scaled by a factor of 10, 100, 10000, etc. This scale factor will be noted in the data attributes.
After completing this tutorial, you will be able to:
Filter data, alone and combined with simple pattern matching grepl().
Use the group_by function in dplyr.
Use the summarise function in dplyr.
"Pipe" functions using dplyr syntax.
Things You’ll Need To Complete This Tutorial
You will need the most current version of R and, preferably, RStudio loaded
on your computer to complete this tutorial.
Install R Packages
neonUtilities: install.packages("neonUtilities") tools for working with
NEON data
dplyr: install.packages("dplyr") used for data manipulation
Intro to dplyr
When working with data frames in R, it is often useful to manipulate and
summarize data. The dplyr package in R offers one of the most comprehensive
group of functions to perform common manipulation tasks. In addition, the
dplyr functions are often of a simpler syntax than most other data
manipulation functions in R.
Elements of dplyr
There are several elements of dplyr that are unique to the library, and that
do very cool things!
Dplyr aims to provide a function for each basic verb of data manipulating, like:
filter() (and slice())
filter rows based on values in specified columns
arrange()
sort data by values in specified columns
select() (and rename())
view and work with data from only specified columns
distinct()
view and work with only unique values from specified columns
mutate() (and transmute())
add new data to the data frame
summarise()
calculate specified summary statistics on data
sample_n() and sample_frac()
return a random sample of rows
Format of function calls
The single table verb functions share these features:
The first argument is a data.frame (or a dplyr special class tbl_df, known
as a 'tibble').
dplyr can work with data.frames as is, but if you're dealing with large
data it's worthwhile to convert them to a tibble, to avoid printing
a lot of data to the screen. You can do this with the function as_tibble().
Calling the class function on a tibble will return the vector
c("tbl_df", "tbl", "data.frame").
The subsequent arguments describe how to manipulate the data (e.g., based on
which columns, using which summary statistics), and you can refer to columns
directly (without using $).
The result is always a new tibble.
Function calls do not generate 'side-effects'; you always have to assign the
results to an object
Grouped operations
Certain functions (e.g., group_by, summarise, and other 'aggregate functions')
allow you to get information for groups of data, in one fell swoop. This is like
performing database functions with knowing SQL or any other db specific code.
Powerful stuff!
Piping
We often need to get a subset of data using one function, and then use
another function to do something with that subset (and we may do this multiple
times). This leads to nesting functions, which can get messy and hard to keep
track of. Enter 'piping', dplyr's way of feeding the output of one function into
another, and so on, without the hassleof parentheses and brackets.
Let's say we want to start with the data frame my_data, apply function1(),
then function2(), and then function3(). This is what that looks like without
piping:
function3(function2(function1(my_data)))
This is messy, difficult to read, and the reverse of the order our functions
actually occur. If any of these functions needed additional arguments, the
readability would be even worse!
The piping operator %>% takes everything in front of it and "pipes" it into
the first argument of the function after. So now our example looks like this:
This runs identically to the original nested version!
For example, if we want to find the mean body weight of male mice, we'd do this:
myMammalData %>% # start with a data frame
filter(sex=='M') %>% # first filter for rows where sex is male
summarise (mean_weight = mean(weight)) # find the mean of the weight
# column, store as mean_weight
This is read as "from data frame myMammalData, select only males and return
the mean weight as a new list mean_weight".
Download Small Mammal Data
Before we get started, we will need to download our data to investigate. To
do so, we will use the loadByProduct() function from the neonUtilities
package to download data straight from the NEON servers. To learn more about
this function, please see the Download and Explore NEON data tutorial here.
Let's look at the NEON small mammal capture data from Harvard Forest (within
Domain 01) for all of 2014. This site is located in the heart of the Lyme
disease epidemic.
# load packages
library(dplyr)
library(neonUtilities)
# load the NEON small mammal capture data
# NOTE: the check.size = TRUE argument means the function
# will require confirmation from you that you want to load
# the quantity of data requested
loadData <- loadByProduct(dpID="DP1.10072.001", site = "HARV",
startdate = "2014-01", enddate = "2014-12",
check.size = TRUE) # Console requires your response!
# if you'd like, check out the data
str(loadData)
The loadByProduct() function calls the NEON server, downloads the monthly
data files, and 'stacks' them together to form two tables of data called
'mam_pertrapnight' and 'mam_perplotnight'. It also saves the readme file,
explanations of variables, and validation metadata, and combines these all into
a single 'list' that we called 'loadData'. The only part of this list that we
really need for this tutorial is the 'mam_pertrapnight' table, so let's extract
just that one and call it 'myData'.
myData <- loadData$mam_pertrapnight
class(myData) # Confirm that 'myData' is a data.frame
## [1] "data.frame"
Use dplyr
For the rest of this tutorial, we are only going to be working with three
variables within 'myData':
scientificName a string of "Genus species"
sex a string with "F", "M", or "U"
identificationQualifier a string noting uncertainty in the species
identification
filter()
This function:
extracts only a subset of rows from a data frame according to specified
conditions
is similar to the base function subset(), but with simpler syntax
inputs: data object, any number of conditional statements on the named columns
of the data object
output: a data object of the same class as the input object (e.g.,
data.frame in, data.frame out) with only those rows that meet the conditions
For example, let's create a new dataframe that contains only female Peromyscus
mainculatus, one of the key small mammal players in the life cycle of Lyme
disease-causing bacterium.
# filter on `scientificName` is Peromyscus maniculatus and `sex` is female.
# two equals signs (==) signifies "is"
data_PeroManicFemales <- filter(myData,
scientificName == 'Peromyscus maniculatus',
sex == 'F')
# Note how we were able to put multiple conditions into the filter statement,
# pretty cool!
So we have a dataframe with our female P. mainculatus but how many are there?
# how many female P. maniculatus are in the dataset
# would could simply count the number of rows in the new dataset
nrow(data_PeroManicFemales)
## [1] 98
# or we could write is as a sentence
print(paste('In 2014, NEON technicians captured',
nrow(data_PeroManicFemales),
'female Peromyscus maniculatus at Harvard Forest.',
sep = ' '))
## [1] "In 2014, NEON technicians captured 98 female Peromyscus maniculatus at Harvard Forest."
That's awesome that we can quickly and easily count the number of female
Peromyscus maniculatus that were captured at Harvard Forest in 2014. However,
there is a slight problem. There are two Peromyscus species that are common
in Harvard Forest: Peromyscus maniculatus (deer mouse) and Peromyscus leucopus
(white-footed mouse). These species are difficult to distinguish in the field,
leading to some uncertainty in the identification, which is noted in the
'identificationQualifier' data field by the term "cf. species" or "aff. species".
(For more information on these terms and 'open nomenclature' please see this great Wiki article here).
When the field technician is certain of their identification (or if they forget
to note their uncertainty), identificationQualifier will be NA. Let's run the
same code as above, but this time specifying that we want only those observations
that are unambiguous identifications.
# filter on `scientificName` is Peromyscus maniculatus and `sex` is female.
# two equals signs (==) signifies "is"
data_PeroManicFemalesCertain <- filter(myData,
scientificName == 'Peromyscus maniculatus',
sex == 'F',
identificationQualifier =="NA")
# Count the number of un-ambiguous identifications
nrow(data_PeroManicFemalesCertain)
## [1] 0
Woah! So every single observation of a Peromyscus maniculatus had some level
of uncertainty that the individual may actually be Peromyscus leucopus. This
is understandable given the difficulty of field identification for these species.
Given this challenge, it will be best for us to consider these mice at the genus
level. We can accomplish this by searching for only the genus name in the
'scientificName' field using the grepl() function.
grepl()
This is a function in the base package (e.g., it isn't part of dplyr) that is
part of the suite of Regular Expressions functions. grepl uses regular
expressions to match patterns in character strings. Regular expressions offer
very powerful and useful tricks for data manipulation. They can be complicated
and therefore are a challenge to learn, but well worth it!
Here, we present a very simple example.
inputs: pattern to match, character vector to search for a match
output: a logical vector indicating whether the pattern was found within
each element of the input character vector
Let's use grepl to learn more about our possible disease vectors. In reality,
all species of Peromyscus are viable players in Lyme disease transmission, and
they are difficult to distinguise in a field setting, so we really should be
looking at all species of Peromyscus. Since we don't have genera split out as
a separate field, we have to search within the scientificName string for the
genus -- this is a simple example of pattern matching.
We can use the dplyr function filter() in combination with the base function
grepl() to accomplish this.
# combine filter & grepl to get all Peromyscus (a part of the
# scientificName string)
data_PeroFemales <- filter(myData,
grepl('Peromyscus', scientificName),
sex == 'F')
# how many female Peromyscus are in the dataset
print(paste('In 2014, NEON technicians captured',
nrow(data_PeroFemales),
'female Peromyscus spp. at Harvard Forest.',
sep = ' '))
## [1] "In 2014, NEON technicians captured 558 female Peromyscus spp. at Harvard Forest."
group_by() + summarise()
An alternative to using the filter function to subset the data (and make a new
data object in the process), is to calculate summary statistics based on some
grouping factor. We'll use group_by(), which does basically the same thing as
SQL or other tools for interacting with relational databases. For those
unfamiliar with SQL, no worries - dplyr provides lots of additional
functionality for working with databases (local and remote) that does not
require knowledge of SQL. How handy!
The group_by() function in dplyr allows you to perform functions on a subset
of a dataset without having to create multiple new objects or construct for()
loops. The combination of group_by() and summarise() are great for
generating simple summaries (counts, sums) of grouped data.
NOTE: Be continentious about using summarise with other summary functions! You
need to think about weighting for means and variances, and summarize doesn't
work precisely for medians if there is any missing data (e.g., if there was no
value recorded, maybe it was for a good reason!).
Continuing with our small mammal data, since the diversity of the entire small
mammal community has been shown to impact disease dynamics among the key
reservoir species, we really want to know more about the demographics of the
whole community. We can quickly generate counts by species and sex in 2 lines of
code, using group_by and summarise.
# how many of each species & sex were there?
# step 1: group by species & sex
dataBySpSex <- group_by(myData, scientificName, sex)
# step 2: summarize the number of individuals of each using the new df
countsBySpSex <- summarise(dataBySpSex, n_individuals = n())
## `summarise()` regrouping output by 'scientificName' (override with `.groups` argument)
# view the data (just top 10 rows)
head(countsBySpSex, 10)
## # A tibble: 10 x 3
## # Groups: scientificName [5]
## scientificName sex n_individuals
## <chr> <chr> <int>
## 1 Blarina brevicauda F 50
## 2 Blarina brevicauda M 8
## 3 Blarina brevicauda U 22
## 4 Blarina brevicauda <NA> 19
## 5 Glaucomys volans M 1
## 6 Mammalia sp. U 1
## 7 Mammalia sp. <NA> 1
## 8 Microtus pennsylvanicus F 2
## 9 Myodes gapperi F 103
## 10 Myodes gapperi M 99
Note: the output of step 1 (dataBySpSex) does not look any different than the
original dataframe (myData), but the application of subsequent functions (e.g.,
summarise) to this new dataframe will produce distinctly different results than
if you tried the same operations on the original. Try it if you want to see the
difference! This is because the group_by() function converted dataBySpSex
into a "grouped_df" rather than just a "data.frame". In order to confirm this,
try using the class() function on both myData and dataBySpSex. You can
also read the help documentation for this function by running the code:
?group_by()
# View class of 'myData' object
class(myData)
## [1] "data.frame"
# View class of 'dataBySpSex' object
class(dataBySpSex)
## [1] "grouped_df" "tbl_df" "tbl" "data.frame"
# View help file for group_by() function
?group_by()
Pipe functions together
We created multiple new data objects during our explorations of dplyr
functions, above. While this works, we can produce the same results more
efficiently by chaining functions together and creating only one new data object
that encapsulates all of the previously sought information: filter on only
females, grepl to get only Peromyscus spp., group_by individual species, and
summarise the numbers of individuals.
# combine several functions to get a summary of the numbers of individuals of
# female Peromyscus species in our dataset.
# remember %>% are "pipes" that allow us to pass information from one function
# to the next.
dataBySpFem <- myData %>%
filter(grepl('Peromyscus', scientificName), sex == "F") %>%
group_by(scientificName) %>%
summarise(n_individuals = n())
## `summarise()` ungrouping output (override with `.groups` argument)
# view the data
dataBySpFem
## # A tibble: 3 x 2
## scientificName n_individuals
## <chr> <int>
## 1 Peromyscus leucopus 455
## 2 Peromyscus maniculatus 98
## 3 Peromyscus sp. 5
Cool!
Base R only
So that is nice, but we had to install a new package dplyr. You might ask,
"Is it really worth it to learn new commands if I can do this is base R." While
we think "yes", why don't you see for yourself. Here is the base R code needed
to accomplish the same task.
# For reference, the same output but using R's base functions
# First, subset the data to only female Peromyscus
dataFemPero <- myData[myData$sex == 'F' &
grepl('Peromyscus', myData$scientificName), ]
# Option 1 --------------------------------
# Use aggregate and then rename columns
dataBySpFem_agg <-aggregate(dataFemPero$sex ~ dataFemPero$scientificName,
data = dataFemPero, FUN = length)
names(dataBySpFem_agg) <- c('scientificName', 'n_individuals')
# view output
dataBySpFem_agg
## scientificName n_individuals
## 1 Peromyscus leucopus 455
## 2 Peromyscus maniculatus 98
## 3 Peromyscus sp. 5
# Option 2 --------------------------------------------------------
# Do it by hand
# Get the unique scientificNames in the subset
sppInDF <- unique(dataFemPero$scientificName[!is.na(dataFemPero$scientificName)])
# Use a loop to calculate the numbers of individuals of each species
sciName <- vector(); numInd <- vector()
for (i in sppInDF) {
sciName <- c(sciName, i)
numInd <- c(numInd, length(which(dataFemPero$scientificName==i)))
}
#Create the desired output data frame
dataBySpFem_byHand <- data.frame('scientificName'=sciName,
'n_individuals'=numInd)
# view output
dataBySpFem_byHand
## scientificName n_individuals
## 1 Peromyscus leucopus 455
## 2 Peromyscus maniculatus 98
## 3 Peromyscus sp. 5
R Script & Challenge Code: NEON data lessons often contain challenges that reinforce
learned skills. If available, the code for challenge solutions is found in the
downloadable R script of the entire lesson, available in the footer of each lesson page.
Additional Resources
Consider reviewing the documentation for the
RHDF5 package.
About HDF5
The HDF5 file can store large, heterogeneous datasets that include metadata. It
also supports efficient data slicing, or extraction of particular subsets of a
dataset which means that you don't have to read large files read into the
computers memory / RAM in their entirety in order work with them.
To access HDF5 files in R, we will use the rhdf5 library which is part of
the Bioconductor
suite of R libraries. It might also be useful to install
the
free HDF5 viewer
which will allow you to explore the contents of an HDF5 file using a graphic interface.
First, let's get R setup. We will use the rhdf5 library. To access HDF5 files in
R, we will use the rhdf5 library which is part of the Bioconductor suite of R
packages. As of May 2020 this package was not yet on CRAN.
# Install rhdf5 package (only need to run if not already installed)
#install.packages("BiocManager")
#BiocManager::install("rhdf5")
# Call the R HDF5 Library
library("rhdf5")
# set working directory to ensure R can find the file we wish to import and where
# we want to save our files
wd <- "~/Git/data/" #This will depend on your local environment
setwd(wd)
Now, let's create a basic H5 file with one group and one dataset in it.
# Create hdf5 file
h5createFile("vegData.h5")
## [1] TRUE
# create a group called aNEONSite within the H5 file
h5createGroup("vegData.h5", "aNEONSite")
## [1] TRUE
# view the structure of the h5 we've created
h5ls("vegData.h5")
## group name otype dclass dim
## 0 / aNEONSite H5I_GROUP
Next, let's create some dummy data to add to our H5 file.
# create some sample, numeric data
a <- rnorm(n=40, m=1, sd=1)
someData <- matrix(a,nrow=20,ncol=2)
Write the sample data to the H5 file.
# add some sample data to the H5 file located in the aNEONSite group
# we'll call the dataset "temperature"
h5write(someData, file = "vegData.h5", name="aNEONSite/temperature")
# let's check out the H5 structure again
h5ls("vegData.h5")
## group name otype dclass dim
## 0 / aNEONSite H5I_GROUP
## 1 /aNEONSite temperature H5I_DATASET FLOAT 20 x 2
View a "dump" of the entire HDF5 file. NOTE: use this command with CAUTION if you
are working with larger datasets!
# we can look at everything too
# but be cautious using this command!
h5dump("vegData.h5")
## $aNEONSite
## $aNEONSite$temperature
## [,1] [,2]
## [1,] 0.33155432 2.4054446
## [2,] 1.14305151 1.3329978
## [3,] -0.57253964 0.5915846
## [4,] 2.82950139 0.4669748
## [5,] 0.47549005 1.5871517
## [6,] -0.04144519 1.9470377
## [7,] 0.63300177 1.9532294
## [8,] -0.08666231 0.6942748
## [9,] -0.90739256 3.7809783
## [10,] 1.84223101 1.3364965
## [11,] 2.04727590 1.8736805
## [12,] 0.33825921 3.4941913
## [13,] 1.80738042 0.5766373
## [14,] 1.26130759 2.2307994
## [15,] 0.52882731 1.6021497
## [16,] 1.59861449 0.8514808
## [17,] 1.42037674 1.0989390
## [18,] -0.65366487 2.5783750
## [19,] 1.74865593 1.6069304
## [20,] -0.38986048 -1.9471878
# Close the file. This is good practice.
H5close()
Add Metadata (attributes)
Let's add some metadata (called attributes in HDF5 land) to our dummy temperature
data. First, open up the file.
# open the file, create a class
fid <- H5Fopen("vegData.h5")
# open up the dataset to add attributes to, as a class
did <- H5Dopen(fid, "aNEONSite/temperature")
# Provide the NAME and the ATTR (what the attribute says) for the attribute.
h5writeAttribute(did, attr="Here is a description of the data",
name="Description")
h5writeAttribute(did, attr="Meters",
name="Units")
Now we can add some attributes to the file.
# let's add some attributes to the group
did2 <- H5Gopen(fid, "aNEONSite/")
h5writeAttribute(did2, attr="San Joaquin Experimental Range",
name="SiteName")
h5writeAttribute(did2, attr="Southern California",
name="Location")
# close the files, groups and the dataset when you're done writing to them!
H5Dclose(did)
H5Gclose(did2)
H5Fclose(fid)
Working with an HDF5 File in R
Now that we've created our H5 file, let's use it! First, let's have a look at
the attributes of the dataset and group in the file.
# look at the attributes of the precip_data dataset
h5readAttributes(file = "vegData.h5",
name = "aNEONSite/temperature")
## $Description
## [1] "Here is a description of the data"
##
## $Units
## [1] "Meters"
# look at the attributes of the aNEONsite group
h5readAttributes(file = "vegData.h5",
name = "aNEONSite")
## $Location
## [1] "Southern California"
##
## $SiteName
## [1] "San Joaquin Experimental Range"
# let's grab some data from the H5 file
testSubset <- h5read(file = "vegData.h5",
name = "aNEONSite/temperature")
testSubset2 <- h5read(file = "vegData.h5",
name = "aNEONSite/temperature",
index=list(NULL,1))
H5close()
Once we've extracted data from our H5 file, we can work with it
in R.
# create a quick plot of the data
hist(testSubset2)
### Challenge: Work with HDF5 Files
Time to practice the skills you've learned. Open up the D17_2013_SJER_vegStr.csv
in R.
Create a new HDF5 file called vegStructure.
Add a group in your HDF5 file called SJER.
Add the veg structure data to that folder.
Add some attributes the SJER group and to the data.
Now, repeat the above with the D17_2013_SOAP_vegStr csv.
Name your second group SOAP
Hint: read.csv() is a good way to read in .csv files.
Publish & share an interactive plot of the data using Plotly.
Subset data by date (if completing Additional Resources code).
Set a NoData Value to NA in R (if completing Additional Resources code).
Things You'll Need To Complete This Lesson
Please be sure you have the most current version of R and, preferably,
RStudio to write your code.
R Libraries to Install:
ggplot2:install.packages("ggplot2")
plotly:install.packages("plotly")
Data to Download
Part of this lesson is to access and download the data directly from NOAA's National
Climate Divisional Database. If instead you would prefer to download the data as a
single compressed file, it can be downloaded from the NEON Data Skills account
on FigShare.
Set Working Directory This lesson assumes that you have set your working
directory to the location of the downloaded and unzipped data subsets.
R Script & Challenge Code: NEON data lessons often contain challenges that
reinforce learned skills. If available, the code for challenge solutions is found
in the downloadable R script of the entire lesson, available in the footer of each lesson page.
Research Question
What was the pattern of drought leading up to the 2013 flooding events in Colorado?
Drought Data - the Palmer Drought Index
The Palmer Drought Severity Index
is an overall index of drought that is useful to understand drought associated
with agriculture. It uses temperature, precipitation and soil moisture data
to calculate water supply and demand. The values of the Palmer Drought Severity Index range from extreme drought
(values <-4.0) through near normal (-.49 to .49) to extremely moist (>4.0).
Access the Data
This section of the tutorial describes how to access and download the data
directly from NOAA's National Climate Divisional Database. You can also
download the pre-compiled data as a compressed file following the directions
in the Data Download section at the top of this lesson. Even if you choose to
use the pre-compiled data, you should still go through this section to learn
about the data we are using and the metadata that accompanies it.
The data used in this tutorial comes from
NOAA's Climate Divisional Database (nCLIMDIV).
This dataset contains temperature, precipitation, and drought indication data
from across the United States. Data can be accessed over national, state, or
divisional extents.
Explore the
nCLIMDIV portal
to learn more about the data they provide and how to retrieve it.
The nCLIMDIV page shows several boxes where we can enter search criteria to find
the particular datasets we need. First, though, we should figure out:
what data are available,
what format the data are available in,
what spatial and temporal extent for the data can we access,
the meaning of any abbreviations & technical terms used in the data files, and
any other information that we'd need to know before deciding which datasets to work with.
What data are available?
Below the search boxes, the nCLIMDIV page shows a table (reproduced below) with the different
indices that are included in any downloaded dataset. Here we see that the
Palmer Drought Severity Index (PDSI) is one of many indices available.
Indecies
Abbreviation
Index
PCP
Precipitation Index
TAVG
Temperature Index
TMIN
Minimum Temperature Index
TMAX
Maximum Temperature Index
PDSI
Palmer Drought Severity Index
PHDI
Palmer Hydrological Drought Index
ZNDX
Palmer Z-Index
PMDI
Modified Palmer Drought Severity Index
CDD
Cooling Degree Days
HDD
Heating Degree Days
SPnn
Standard Precipitation Index
Sample Data
Below the table is a link to the Comma Delimited Sample where you can see
a sample of what the data looks like. Using the table above we can can identify
most of the headers. YearMonth is also pretty self-explanatory - it is the year
then the month digit (YYYYMM) with no space. However, what do StateCode
and Division mean? We need to know more about these abbreviations before we can use this dataset.
Metadata File
Below the table is another link to the Divisional Data Description. Click on
this link and you will be taken to a page with the metadata for the nCLIMDIV data (this file
is included in the Download Data .zip file -- divisional-readme.txt).
Skim through this metadata file.
Can you find out what the StateCode is?
What other information is important or interesting to you?
We are interested in the Palmer Drought Severity Index. What information
does the metadata give you about this Index?
Download the Data
Now that we know a bit more about the contents of the dataset, we can access and download
the particular data we need explore the drought leading up to the 2013 flooding
events in Colorado.
Let's look at a 25-year period from 1990-2015. We will enter the following
information on the State tab to get our desired dataset:
Select Nation: US
Select State: Colorado
Start Period: January (01) 1991
End Period: December (12) 2015
Text Output: Comma Delimited
These search criteria result in a text file (.txt) that you can open in
your browser or download and open with a text editor.
Save the Data
To save this data file to your computer, right click on the link and select Save link as.
Each download from this site is given a unique code, therefore your file
will have a slightly different name from this examples. To help remember where the data
are from, add the initials _CO to the end of the file name (but before the
file extension) so that it looks like this CDODiv8506877122044_CO.txt (remember
that the code name for your file will be different).
Save the file to the ~/Documents/data/disturb-events-co13/drought directory
that you created in the set up for this tutorial.
Load the Data in R
Now that we have the data we can start working with it in R.
R Libraries
We will use ggplot2 to efficiently plot our data and plotly to create
interactive plots of our data.
We are interested in looking at the severity (or lack thereof) of drought in
Colorado before the 2013 floods. The first step is to import the data we just downloaded into R.
Our data have a header (which we saw in the sample file). This first row represents the
variable name for each column. We will use header=TRUE when we import the
data to tell R to use that first row as a list of column names rather than a row of data.
## Set your working directory to ensure R can find the file we wish to import and where we want to save our files. Be sure to move the downloaded files into your working directory!
wd <- "~/Git/data/" # This will depend on your local environment
setwd(wd)
# Import CO state-wide nCLIMDIV data
nCLIMDIV <- read.csv(paste0(wd,"disturb-events-co13/drought/CDODiv8506877122044_CO.txt"), header = TRUE)
# view data structure
str(nCLIMDIV)
## 'data.frame': 300 obs. of 21 variables:
## $ StateCode: int 5 5 5 5 5 5 5 5 5 5 ...
## $ Division : int 0 0 0 0 0 0 0 0 0 0 ...
## $ YearMonth: int 199101 199102 199103 199104 199105 199106 199107 199108 199109 199110 ...
## $ PCP : num 0.8 0.44 1.98 1.27 1.63 1.88 2.69 2.44 1.36 1.06 ...
## $ TAVG : num 21.9 32.5 34.9 41.9 53.5 62.5 66.5 65.5 57.5 47.4 ...
## $ PDSI : num -1.37 -1.95 -1.77 -1.89 -2.11 0.11 0.6 1.03 0.95 0.67 ...
## $ PHDI : num -1.37 -1.95 -1.77 -1.89 -2.11 -1.79 -1.11 1.03 0.95 0.67 ...
## $ ZNDX : num -0.9 -2.17 -0.07 -0.92 -1.25 0.33 1.49 1.5 0.07 -0.54 ...
## $ PMDI : num -0.4 -1.48 -1.28 -1.63 -2.11 -1.57 -0.15 1.03 0.89 0.09 ...
## $ CDD : int 0 0 0 0 3 62 95 73 12 0 ...
## $ HDD : int 1296 868 900 678 343 113 30 45 227 555 ...
## $ SP01 : num -0.4 -1.78 0.89 -0.56 -0.35 0.65 0.96 0.7 -0.01 -0.26 ...
## $ SP02 : num -0.47 -1.42 -0.11 0.09 -0.67 0.15 1.01 1.07 0.42 -0.26 ...
## $ SP03 : num 0.05 -1.28 -0.36 -0.56 -0.19 -0.28 0.55 1.11 0.78 0.13 ...
## $ SP06 : num 0.42 0.15 0.03 -0.47 -0.86 -0.48 -0.03 0.51 0.24 0.35 ...
## $ SP09 : num 0.41 0.11 0.85 -0.07 -0.07 -0.19 -0.02 0 0.03 0.01 ...
## $ SP12 : num 0.69 0.41 0.43 0.08 -0.06 0.39 0.19 0.49 0.21 0.03 ...
## $ SP24 : num -0.41 -0.72 -0.49 -0.37 -0.26 -0.24 -0.06 0.12 0.05 0.11 ...
## $ TMIN : num 9.5 17.7 22.3 28.4 39.3 48.1 52 51.6 43.1 31.9 ...
## $ TMAX : num 34.3 47.4 47.5 55.3 67.7 76.9 81.1 79.5 72 62.9 ...
## $ X : logi NA NA NA NA NA NA ...
Using head() or str() allows us to view just a sampling of our data.
One of the first things we always check is if whether the format that R
interpreted the data to be in is the format we want.
The Palmer Drought Severity Index (PDSI) is a number, so it was imported correctly!
Date Fields
Let's have a look at the date field in our data, which has the column name YearMonth.
Viewing the structure, it appears as if it is not in a standard date format.
It imported into R as an integer (int).
$ YearMonth: int 199001 199002 199003 199004 199005 ...
We want to convert these numbers into a date field. We might be able to use the
as.Date function to convert our string of numbers into a date that R will
recognize.
# convert to date, and create a new Date column
nCLIMDIV$Date <- as.Date(nCLIMDIV$YearMonth, format="%Y%m")
Oops, that doesn't work! R returned an origin error. The date class expects to
have day, month, and year data instead of just year and month. R needs a day
of the month in order to properly
convert this to a date class. The origin error is saying that it doesn't know where
to start counting the dates. (Note: If you have the zoo package installed you
will not get this error but Date will be filled with NAs.)
We can add a fake "day" to our YearMonth data using the paste0 function. Let's
add 01 to each field so R thinks that each date represents the first of the
month.
#add a day of the month to each year-month combination
nCLIMDIV$Date <- paste0(nCLIMDIV$YearMonth,"01")
#convert to date
nCLIMDIV$Date <- as.Date(nCLIMDIV$Date, format="%Y%m%d")
# check to see it works
str(nCLIMDIV$Date)
## Date[1:300], format: "1991-01-01" "1991-02-01" "1991-03-01" "1991-04-01" ...
We've now successfully converted our integer class YearMonth column into the
Date column in a date class.
Plot the Data
Next, let's plot the data using ggplot().
# plot the Palmer Drought Index (PDSI)
palmer.drought <- ggplot(data=nCLIMDIV,
aes(Date,PDSI)) + # x is Date & y is drought index
geom_bar(stat="identity",position = "identity") + # bar plot
xlab("Date") + ylab("Palmer Drought Severity Index") + # axis labels
ggtitle("Palmer Drought Severity Index - Colorado \n 1991-2015") # title on 2 lines (\n)
# view the plot
palmer.drought
Great - we've successfully created a plot!
Questions
Which values, positive or negative, correspond to years of drought?
Were the months leading up to the September 2013 floods a drought?
What overall patterns do you see in "wet" and "dry" years over these 25 years?
Is the average year in Colorado wet or dry?
Are there more wet years or dry years over this period of time?
These last two questions are a bit hard to determine from this plot. Let's look
at a quick summary of our data to help us out.
#view summary stats of the Palmer Drought Severity Index
summary(nCLIMDIV$PDSI)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## -9.090 -1.702 0.180 -0.310 1.705 5.020
#view histogram of the data
hist(nCLIMDIV$PDSI, # the date we want to use
main="Histogram of PDSI", # title
xlab="Palmer Drought Severity Index (PDSI)", # x-axis label
col="wheat3") # the color of the bars
Now we can see that the "median" year is slightly wet (0.180) but the
"mean" year is slightly dry (-0.310), although both are within the "near-normal"
range of -0.41 to 0.41. We can also see that there is a longer tail on the dry side
of the histogram than on the wet side.
Both of these figures only give us a static view of the data. There are
packages in R that allow us to create figures that can be published
to the web and allow us to explore the data in a more interactive way.
Plotly - Interactive (and Online) Plots
Plotly
allows you to create interactive plots that can be shared online. If
you are new to Plotly, view the companion mini-lesson
Interactive Data Vizualization with R and Plotly
to learn how to set up an account and access your username and API key.
Step 1: Connect your Plotly account to R
First, we need to connect our R session to our Plotly account. If you only want
to create interactive plots and not share them on a Plotly account, you can skip
this step.
# set plotly user name
Sys.setenv("plotly_username"="YOUR_plotly_username")
# set plotly API key
Sys.setenv("plotly_api_key"="YOUR_api_key")
Step 2: Create a Plotly plot
We can create an interactive version of our plot using plot.ly. We should simply be able to use our existing ggplot palmer.drought with the
ggplotly() function to create an interactive plot.
# Use existing ggplot plot & view as plotly plot in R
palmer.drought_ggplotly <- ggplotly(palmer.drought)
palmer.drought_ggplotly
That doesn't look right. The current plotly package has a bug! This
bug has been reported and a fix may come out in future updates to the package.
Until that happens, we can build our plot again using the plot_ly() function.
In the future, you could just skip the ggplot() step and plot directly with
plot_ly().
# use plotly function to create plot
palmer.drought_plotly <- plot_ly(nCLIMDIV, # the R object dataset
type= "bar", # the type of graph desired
x=nCLIMDIV$Date, # our x data
y=nCLIMDIV$PDSI, # our y data
orientation="v", # for bars to orient vertically ("h" for horizontal)
title=("Palmer Drought Severity Index - Colorado 1991-2015"))
palmer.drought_plotly
Questions
Check out the differences between the ggplot() and the plot_ly() plot.
From both plots, answer these questions (Hint: Hover your cursor over the bars
of the plotly plot)
What is the minimum value?
In what month did the lowest PDSI occur?
In what month, and of what magnitude, did the highest PDSI occur?
What was the drought severity value in August 2013, the month before the
flooding?
Contrast ggplot() and plot_ly() functions.
What are the biggest differences you see between ggplot & plot_ly function?
When might you want to use ggplot()?
When might plotly() be better?
Step 3: Push to Plotly online
Once the plot is built with a Plotly related function (plot_ly or ggplotly) you can post the plot to your online account. Make sure you are signed in to Plotly to complete this step.
# publish plotly plot to your plot.ly online account when you are happy with it
# skip this step if you haven't connected a Plotly account
api_create(palmer.drought_plotly)
Questions
Now that we can see the online Plotly user interface, we can explore our plots a bit more.
Each plot can have comments added below the plot to serve as a caption. What would be appropriate information to add for this plot?
Who might you want to share this plot with? What tools are there to share this plot?
Challenge: Does spatial scale affect the pattern?
In the steps above we have been looking at data aggregated across the entire state of Colorado. What if we look just at the watershed that includes the Boulder area where our investigation is centered?
If you zoom in on the
Divisional Map,
you can see that Boulder, CO is in the CO-04 Platte River Drainage.
Use the divisional data and the process you've just learned to create a plot of
the PDSI for Colorado Division 04 to compare to the plot for the state of
Colorado that you've already made.
If you are using the downloaded dataset accompanying this lesson, this data can be found at "drought/CDODiv8868227122048_COdiv04.txt".
Challenge: Do different measures show the same pattern?
The nCLIMDIV dataset includes not only the Palmer Drought Severity Index
but also several other measures of precipitation, drought, and temperature.
Choose one and repeat the steps above to see if a different but related
measure shows a similar pattern. Make sure to go back to the
metadata so that you understand what the index or measurement you choose represents.
Additional Resources
No Data Values
If you choose to explore other time frames or spatial scales you may
come across data that appear as if they have a negative value -99.99.
If this were real, it would be a very severe drought!
This value is just a common placeholder for a No Data Value.
Think about what happens if the instruments failed for a little while and stopped producing meaningful data. The instruments can't record 0 for this Index because 0 means "normal" levels. Using a blank entry isn't a good idea for several
reason: they cause problems for software reading a file, they can mess up table structure, and you don't know if the data was missing (someone forgot to enter it)
or if no data were available. Therefore, a placeholder value is often used. This value changes between disciplines but -9999 or -99.99 are common.
In R, we need to assign this placeholder value to NA, which is R's
representation of a null or NoData value. If we don't, R will include the -99.99 value whenever calculations are performed
or plots are created. By defining the placeholders as NA, R will correctly interpret, and not plot, the bad values.
Using the nCLIMDIV dataset for the entire US, this is how we'd assign the placeholder
value to NA and plot the data.
# NoData Value in the nCLIMDIV data from 1990-2015 US spatial scale
nCLIMDIV_US <- read.csv(paste0(wd,"disturb-events-co13/drought/CDODiv5138116888828_US.txt"), header = TRUE)
# add a day of the month to each year-month combination
nCLIMDIV_US$Date <- paste0(nCLIMDIV_US$YearMonth,"01")
# convert to date
nCLIMDIV_US$Date <- as.Date(nCLIMDIV_US$Date, format="%Y%m%d")
# check to see it works
str(nCLIMDIV_US$Date)
## Date[1:312], format: "1990-01-01" "1990-02-01" "1990-03-01" "1990-04-01" ...
# view histogram of data -- great way to check the data range
hist(nCLIMDIV_US$PDSI,
main="Histogram of PDSI values",
col="springgreen4")
# easy to see the "wrong" values near 100
# check for these values using min() - what is the minimum value?
min(nCLIMDIV_US$PDSI)
## [1] -99.99
# assign -99.99 to NA in the PDSI column
# Note: you may want to do this across the entire data.frame or with other columns
# but check out the metadata -- there are different NoData Values for each column!
nCLIMDIV_US[nCLIMDIV_US$PDSI==-99.99,] <- NA # == is the short hand for "it is"
#view the histogram again - does the range look reasonable?
hist(nCLIMDIV_US$PDSI,
main="Histogram of PDSI value with NA value assigned",
col="springgreen4")
# that looks better!
#plot Palmer Drought Index data
ggplot(data=nCLIMDIV_US,
aes(Date,PDSI)) +
geom_bar(stat="identity",position = "identity") +
xlab("Year") + ylab("Palmer Drought Severity Index") +
ggtitle("Palmer Drought Severity Index - Colorado\n1991 thru 2015")
## Warning: Removed 2 rows containing missing values (geom_bar).
# The warning message lets you know that two "entries" will be missing from the
# graph -- these are the ones we assigned NA.
Subsetting Data
After you have downloaded the data, you might decide that you want to plot only
a subset of the data range you downloaded -- say, just the decade 2005 to 2015
instead of the full record from 1990 to 2015. With the Plotly interactive plots you can zoom in on
that section, but even so you might want a plot with only a section of the data.
You could re-download the data with a new search, but that would create extra,
possibly confusing, data files! Instead, we can easily select a subset of the data within R. Once we
have a column of data defined as a Date class in R, we can quickly
subset the data by date and create a new R object using the subset() function.
To subset, we use the subset function, and specify:
the R object that we wish to subset,
the date column and start date of the subset, and
the date column and end date of the subset.
Let's subset the data.
# subset out data between 2005 and 2015
nCLIMDIV2005.2015 <- subset(nCLIMDIV, # our R object dataset
Date >= as.Date('2005-01-01') & # start date
Date <= as.Date('2015-12-31')) # end date
# check to make sure it worked
head(nCLIMDIV2005.2015$Date) # head() shows first 6 lines
## [1] "2005-01-01" "2005-02-01" "2005-03-01" "2005-04-01" "2005-05-01"
## [6] "2005-06-01"
tail(nCLIMDIV2005.2015$Date) # tail() shows last 6 lines
## [1] "2015-07-01" "2015-08-01" "2015-09-01" "2015-10-01" "2015-11-01"
## [6] "2015-12-01"
Now we can plot this decade of data. Hint, we can copy/paste and edit the
previous code.
# use plotly function to create plot
palmer_plotly0515 <- plot_ly(nCLIMDIV2005.2015, # the R object dataset
type= "bar", # the type of graph desired
x=nCLIMDIV2005.2015$Date, # our x data
y=nCLIMDIV2005.2015$PDSI, # our y data
orientation="v", # for bars to orient vertically ("h" for horizontal)
title=("Palmer Drought Severity Index - Colorado 2005-2015"))
palmer_plotly0515
# publish plotly plot to your plot.ly online account when you are happy with it
# skip this step if you haven't connected a Plotly account
api_create(palmer_plotly0515)