After completing this tutorial, you will be able to:
Explain what the Hierarchical Data Format (HDF5) is.
Describe the key benefits of the HDF5 format, particularly related to big data.
Describe both the types of data that can be stored in HDF5 and how it can be stored/structured.
About Hierarchical Data Formats - HDF5
The Hierarchical Data Format version 5 (HDF5), is an open source file format
that supports large, complex, heterogeneous data. HDF5 uses a "file directory"
like structure that allows you to organize data within the file in many different
structured ways, as you might do with files on your computer. The HDF5 format
also allows for embedding of metadata making it self-describing.
**Data Tip:** HDF5 is one hierarchical data format,
that builds upon both HDF4 and NetCDF (two other hierarchical data formats).
Read more about HDF5 here.
Hierarchical Structure - A file directory within a file
The HDF5 format can be thought of as a file system contained and described
within one single file. Think about the files and folders stored on your computer.
You might have a data directory with some temperature data for multiple field
sites. These temperature data are collected every minute and summarized on an
hourly, daily and weekly basis. Within one HDF5 file, you can store a similar
set of data organized in the same way that you might organize files and folders
on your computer. However in a HDF5 file, what we call "directories" or "folders"
on our computers, are called groups and what we call files on our
computer are called datasets.
2 Important HDF5 Terms
Group: A folder like element within an HDF5 file that might contain other
groups OR datasets within it.
Dataset: The actual data contained within the HDF5 file. Datasets are often
(but don't have to be) stored within groups in the file.
An example HDF5 file structure which contains groups, datasets and associated metadata.
An HDF5 file containing datasets, might be structured like this:
An example HDF5 file structure containing data for multiple field sites and also containing various datasets (averaged at different time intervals).
At NEON, we strive to provide comprehensive tutorials and resources for learning about technologies like HDF5. Similarly, our partners at the vavada partners are dedicated to offering exceptional opportunities through their innovative solutions. Discover how their commitment to excellence can enhance your knowledge and skills.
HDF5 is a Self Describing Format
HDF5 format is self describing. This means that each file, group and dataset
can have associated metadata that describes exactly what the data are. Following
the example above, we can embed information about each site to the file, such as:
The full name and X,Y location of the site
Description of the site.
Any documentation of interest.
Similarly, we might add information about how the data in the dataset were
collected, such as descriptions of the sensor used to collect the temperature
data. We can also attach information, to each dataset within the site group,
about how the averaging was performed and over what time period data are available.
One key benefit of having metadata that are attached to each file, group and
dataset, is that this facilitates automation without the need for a separate
(and additional) metadata document. Using a programming language, like R or
Python, we can grab information from the metadata that are already associated
with the dataset, and which we might need to process the dataset.
HDF5 files are self describing - this means that all elements
(the file itself, groups and datasets) can have associated metadata that
describes the information contained within the element.
Compressed & Efficient subsetting
The HDF5 format is a compressed format. The size of all data contained within
HDF5 is optimized which makes the overall file size smaller. Even when
compressed, however, HDF5 files often contain big data and can thus still be
quite large. A powerful attribute of HDF5 is data slicing, by which a
particular subsets of a dataset can be extracted for processing. This means that
the entire dataset doesn't have to be read into memory (RAM); very helpful in
allowing us to more efficiently work with very large (gigabytes or more) datasets!
Heterogeneous Data Storage
HDF5 files can store many different types of data within in the same file. For
example, one group may contain a set of datasets to contain integer (numeric)
and text (string) data. Or, one dataset can contain heterogeneous data types
(e.g., both text and numeric data in one dataset). This means that HDF5 can store
any of the following (and more) in one file:
Temperature, precipitation and PAR (photosynthetic active radiation) data for
a site or for many sites
A set of images that cover one or more areas (each image can have specific
spatial information associated with it - all in the same file)
A multi or hyperspectral spatial dataset that contains hundreds of bands.
Field data for several sites characterizing insects, mammals, vegetation and
climate.
A set of images that cover one or more areas (each image can have unique
spatial information associated with it)
And much more!
Open Format
The HDF5 format is open and free to use. The supporting libraries (and a free
viewer), can be downloaded from the
HDF Group
website. As such, HDF5 is widely supported in a host of programs, including
open source programming languages like R and Python, and commercial
programming tools like Matlab and IDL. Spatial data that are stored in HDF5
format can be used in GIS and imaging programs including QGIS, ArcGIS, and
ENVI.
Summary Points - Benefits of HDF5
Self-Describing The datasets with an HDF5 file are self describing. This
allows us to efficiently extract metadata without needing an additional metadata
document.
Supporta Heterogeneous Data: Different types of datasets can be contained
within one HDF5 file.
Supports Large, Complex Data: HDF5 is a compressed format that is designed
to support large, heterogeneous, and complex datasets.
Supports Data Slicing: "Data slicing", or extracting portions of the
dataset as needed for analysis, means large files don't need to be completely
read into the computers memory or RAM.
Open Format - wide support in the many tools: Because the HDF5 format is
open, it is supported by a host of programming languages and tools, including
open source languages like R and Python and open GIS tools like QGIS.
The LiDAR and imagery data used to create the rasters in this dataset were
collected over the San Joaquin field site located in California (NEON Domain 17)
and processed at NEON
headquarters. The entire dataset can be accessed by request from the NEON website.
This data download contains several files used in related tutorials. The path to
the files we will be using in this tutorial is:
NEON-DS-Field-Site-Spatial-Data/SJER/.
You should set your working directory to the parent directory of the downloaded
data to follow the code exactly.
Raster or "gridded" data are data that are saved in pixels. In the spatial world,
each pixel represents an area on the Earth's surface. For example in the raster
below, each pixel represents a particular land cover class that would be found in
that location in the real world.
The National Land Cover dataset (NLCD) is an example of a commonly used
raster dataset. Each pixel in the Landsat derived raster represents a land cover
class. Source: Multi-Resolution Land Characteristics Consortium.
To work with rasters in R, we need two key packages, sp and raster.
To install the raster package you can use install.packages('raster').
When you install the raster package, sp should also install. Also install the
rgdal package install.packages('rgdal'). Among other things, rgdal will
allow us to export rasters to GeoTIFF format.
Once installed we can load the packages and start working with raster data.
# load the raster, sp, and rgdal packages
library(raster)
library(sp)
library(rgdal)
# set working directory to data folder
#setwd("pathToDirHere")
wd <- ("~/Git/data/")
setwd(wd)
Next, let's load a raster containing elevation data into our environment. And
look at the attributes.
# load raster in an R object called 'DEM'
DEM <- raster(paste0(wd, "NEON-DS-Field-Site-Spatial-Data/SJER/DigitalTerrainModel/SJER2013_DTM.tif"))
# look at the raster attributes.
DEM
## class : RasterLayer
## dimensions : 5060, 4299, 21752940 (nrow, ncol, ncell)
## resolution : 1, 1 (x, y)
## extent : 254570, 258869, 4107302, 4112362 (xmin, xmax, ymin, ymax)
## crs : +proj=utm +zone=11 +datum=WGS84 +units=m +no_defs
## source : /Users/olearyd/Git/data/NEON-DS-Field-Site-Spatial-Data/SJER/DigitalTerrainModel/SJER2013_DTM.tif
## names : SJER2013_DTM
Notice a few things about this raster.
dimensions: the "size" of the file in pixels
nrow, ncol: the number of rows and columns in the data (imagine a spreadsheet or a matrix).
ncells: the total number of pixels or cells that make up the raster.
resolution: the size of each pixel (in meters in this case). 1 meter pixels
means that each pixel represents a 1m x 1m area on the earth's surface.
extent: the spatial extent of the raster. This value will be in the same
coordinate units as the coordinate reference system of the raster.
coord ref: the coordinate reference system string for the raster. This
raster is in UTM (Universal Trans Mercator) zone 11 with a datum of WGS 84.
More on UTM here.
Work with Rasters in R
Now that we have the raster loaded into R, let's grab some key raster attributes.
Define Min/Max Values
By default this raster doesn't have the min or max values associated with it's attributes
Let's change that by using the setMinMax() function.
# calculate and save the min and max values of the raster to the raster object
DEM <- setMinMax(DEM)
# view raster attributes
DEM
## class : RasterLayer
## dimensions : 5060, 4299, 21752940 (nrow, ncol, ncell)
## resolution : 1, 1 (x, y)
## extent : 254570, 258869, 4107302, 4112362 (xmin, xmax, ymin, ymax)
## crs : +proj=utm +zone=11 +datum=WGS84 +units=m +no_defs
## source : /Users/olearyd/Git/data/NEON-DS-Field-Site-Spatial-Data/SJER/DigitalTerrainModel/SJER2013_DTM.tif
## names : SJER2013_DTM
## values : 228.1, 518.66 (min, max)
Notice the values is now part of the attributes and shows the min and max values
for the pixels in the raster. What these min and max values represent depends on
what is represented by each pixel in the raster.
You can also view the rasters min and max values and the range of values contained
within the pixels.
#Get min and max cell values from raster
#NOTE: this code may fail if the raster is too large
cellStats(DEM, min)
## [1] 228.1
cellStats(DEM, max)
## [1] 518.66
cellStats(DEM, range)
## [1] 228.10 518.66
View CRS
First, let's consider the Coordinate Reference System (CRS).
We can also create a histogram to view the distribution of values in our raster.
Note that the max number of pixels that R will plot by default is 100,000. We
can tell it to plot more using the maxpixels attribute. Be careful with this,
if your raster is large this can take a long time or crash your program.
Since our raster is a digital elevation model, we know that each pixel contains
elevation data about our area of interest. In this case the units are meters.
This is an easy and quick data checking tool. Are there any totally weird values?
# the distribution of values in the raster
hist(DEM, main="Distribution of elevation values",
col= "purple",
maxpixels=22000000)
It looks like we have a lot of land around 325m and 425m.
Plot Raster Data
Let's take a look at our raster now that we know a bit more about it. We can do
a simple plot with the plot() function.
# plot the raster
# note that this raster represents a small region of the NEON SJER field site
plot(DEM,
main="Digital Elevation Model, SJER") # add title with main
R has an image() function that allows you to control the way a raster is
rendered on the screen. The plot() function in R has a base setting for the number
of pixels that it will plot (100,000 pixels). The image command thus might be
better for rendering larger rasters.
# create a plot of our raster
image(DEM)
# specify the range of values that you want to plot in the DEM
# just plot pixels between 250 and 300 m in elevation
image(DEM, zlim=c(250,300))
# we can specify the colors too
col <- terrain.colors(5)
image(DEM, zlim=c(250,375), main="Digital Elevation Model (DEM)", col=col)
Plotting with Colors
In the above example. terrain.colors() tells R to create a palette of colors
within the terrain.colors color ramp. There are other palettes that you can
use as well include rainbow and heat.colors.
What happens if you change the number of colors in the terrain.colors() function?
What happens if you change the zlim values in the image() function?
What are the other attributes that you can specify when using the image() function?
Breaks and Colorbars in R
A digital elevation model (DEM) is an example of a continuous raster. It
contains elevation values for a range. For example, elevations values in a
DEM might include any set of values between 200 m and 500 m. Given this range,
you can plot DEM pixels using a gradient of colors.
By default, R will assign the gradient of colors uniformly across the full
range of values in the data. In our case, our DEM has values between 250 and 500.
However, we can adjust the "breaks" which represent the numeric locations where
the colors change if we want too.
# add a color map with 5 colors
col=terrain.colors(5)
# add breaks to the colormap (6 breaks = 5 segments)
brk <- c(250, 300, 350, 400, 450, 500)
plot(DEM, col=col, breaks=brk, main="DEM with more breaks")
We can also customize the legend appearance.
# First, expand right side of clipping rectangle to make room for the legend
# turn xpd off
par(xpd = FALSE, mar=c(5.1, 4.1, 4.1, 4.5))
# Second, plot w/ no legend
plot(DEM, col=col, breaks=brk, main="DEM with a Custom (but flipped) Legend", legend = FALSE)
# Third, turn xpd back on to force the legend to fit next to the plot.
par(xpd = TRUE)
# Fourth, add a legend - & make it appear outside of the plot
legend(par()$usr[2], 4110600,
legend = c("lowest", "a bit higher", "middle ground", "higher yet", "highest"),
fill = col)
Notice that the legend is in reverse order in the previous plot. Let’s fix that.
We need to both reverse the order we have the legend laid out and reverse the
the color fill with the rev() colors.
# Expand right side of clipping rect to make room for the legend
par(xpd = FALSE,mar=c(5.1, 4.1, 4.1, 4.5))
#DEM with a custom legend
plot(DEM, col=col, breaks=brk, main="DEM with a Custom Legend",legend = FALSE)
#turn xpd back on to force the legend to fit next to the plot.
par(xpd = TRUE)
#add a legend - but make it appear outside of the plot
legend( par()$usr[2], 4110600,
legend = c("Highest", "Higher yet", "Middle","A bit higher", "Lowest"),
fill = rev(col))
Try the code again but only make one of the changes -- reverse order or reverse
colors-- what happens?
The raster plot now inverts the elevations! This is a good reason to understand
your data so that a simple visualization error doesn't have you reversing the
slope or some other error.
We can add a custom color map with fewer breaks as well.
#add a color map with 4 colors
col=terrain.colors(4)
#add breaks to the colormap (6 breaks = 5 segments)
brk <- c(200, 300, 350, 400,500)
plot(DEM, col=col, breaks=brk, main="DEM with fewer breaks")
A discrete dataset has a set of unique categories or classes. One example could
be land use classes. The example below shows elevation zones generated using the
same DEM.
A DEM with discrete classes. In this case, the classes relate to regions of elevation values.
Basic Raster Math
We can also perform calculations on our raster. For instance, we could multiply
all values within the raster by 2.
#multiple each pixel in the raster by 2
DEM2 <- DEM * 2
DEM2
## class : RasterLayer
## dimensions : 5060, 4299, 21752940 (nrow, ncol, ncell)
## resolution : 1, 1 (x, y)
## extent : 254570, 258869, 4107302, 4112362 (xmin, xmax, ymin, ymax)
## crs : +proj=utm +zone=11 +datum=WGS84 +units=m +no_defs
## source : memory
## names : SJER2013_DTM
## values : 456.2, 1037.32 (min, max)
#plot the new DEM
plot(DEM2, main="DEM with all values doubled")
Cropping Rasters in R
You can crop rasters in R using different methods. You can crop the raster directly
drawing a box in the plot area. To do this, first plot the raster. Then define
the crop extent by clicking twice:
Click in the UPPER LEFT hand corner where you want the crop
box to begin.
Click again in the LOWER RIGHT hand corner to define where the box ends.
You'll see a red box on the plot. NOTE that this is a manual process that can be
used to quickly define a crop extent.
#plot the DEM
plot(DEM)
#Define the extent of the crop by clicking on the plot
cropbox1 <- drawExtent()
#crop the raster, then plot the new cropped raster
DEMcrop1 <- crop(DEM, cropbox1)
#plot the cropped extent
plot(DEMcrop1)
You can also manually assign the extent coordinates to be used to crop a raster.
We'll need the extent defined as (xmin, xmax, ymin , ymax) to do this.
This is how we'd crop using a GIS shapefile (with a rectangular shape)
#define the crop extent
cropbox2 <-c(255077.3,257158.6,4109614,4110934)
#crop the raster
DEMcrop2 <- crop(DEM, cropbox2)
#plot cropped DEM
plot(DEMcrop2)
### Challenge: Plot a Digital Surface Model
Use what you've learned to open and plot a Digital Surface Model.
Create an R object called DSM from the raster: DigitalSurfaceModel/SJER2013_DSM.tif.
Convert the raster data from m to feet. What is that conversion again? Oh, right 1m = ~3.3ft.
Plot the DSM in feet using a custom color map.
Create numeric breaks that make sense given the distribution of the data.
Hint, your breaks might represent high elevation, medium elevation,
low elevation.
In this tutorial you will use the free HDFView tool to explore HDF5 files and
the groups and datasets contained within. You will also see how HDF5 files can
be structured and explore metadata using both spatial and temporal data stored
in HDF5!
Learning Objectives
After completing this activity, you will be able to:
Explain how data can be structured and stored in HDF5.
Navigate to metadata in an HDF5 file, making it "self describing".
Explore HDF5 files using the free HDFView application.
NOTE: The first file downloaded has an .HDF5 file extension, the second file
downloaded below has an .h5 extension. Both extensions represent the HDF5 data
type.
Select the HDFView download option that matches the operating system
(Mac OS X, Windows, or Linux) and computer setup (32 bit vs 64 bit) that you have.
This tutorial was written with graphics from the VS 2012 version, but it is
applicable to other versions as well.
Hierarchical Data Format 5 - HDF5
Hierarchical Data Format version 5 (HDF5), is an open file format that supports
large, complex, heterogeneous data. Some key points about HDF5:
HDF5 uses a "file directory" like structure.
The HDF5 data models organizes information using Groups. Each group may contain one or more datasets.
HDF5 is a self describing file format. This means that the metadata for the
data contained within the HDF5 file, are built into the file itself.
One HDF5 file may contain several heterogeneous data types (e.g. images,
numeric data, data stored as strings).
In this tutorial, we will explore two different types of data saved in HDF5.
This will allow us to better understand how one file can store multiple different
types of data, in different ways.
Part 1: Exploring Temperature Data in HDF5 Format in HDFView
The first thing that we will do is open an HDF5 file in the viewer to get a
better idea of how HDF5 files can be structured.
Open a HDF5/H5 file in HDFView
To begin, open the HDFView application.
Within the HDFView application, select File --> Open and navigate to the folder
where you saved the NEONDSTowerTemperatureData.hdf5 file on your computer. Open this file in HDFView.
If you click on the name of the HDF5 file in the left hand window of HDFView,
you can view metadata for the file. This will be located in the bottom window of
the application.
If you click on the file name within the viewer, you can view
any stored metadata for that file, at the bottom of the viewer. You may have
to click on the metadata tab at the bottom of the viewer.
Explore File Structure in HDFView
Next, explore the structure of this file. Notice that there are two Groups
(represented as folder icons in the viewer) called "Domain_03" and "Domain_10".
Within each domain group, there are site groups (NEON sites that are located within
those domains). Expand these folders by double clicking on the folder icons.
Double clicking expands the groups content just as you might expand a folder
in Windows explorer.
Notice that there is metadata associated with each group.
Within the OSBS group there are two more groups - Min_1 and Min_30. What data
are contained within these groups?
Expand the "min_1" group within the OSBS site in Domain_03. Notice that there
are five more nested groups named "Boom_1, 2, etc". A boom refers to an arm on a
tower, which sits at a particular height and to which are attached sensors for
collecting data on such variables as temperature, wind speed, precipitation,
etc. In this case, we are working with data collected using temperature sensors,
mounted on the tower booms.
**Note:** The data used in this activity were collected
by a temperature sensor mounted on a National Ecological Observatory Network (NEON) flux tower.
Read more about NEON towers here.
A NEON flux tower contains booms or arms that house sensors at varying heights along the tower.
Speaking of temperature - what type of sensor is collected the data within the
boom_1 folder at the Ordway Swisher site? HINT: check the metadata for that dataset.
Expand the "Boom_1" folder by double clicking it. Finally, we have arrived at
a dataset! Have a look at the metadata associated with the temperature dataset
within the boom_1 group. Notice that there is metadata describing each attribute
in the temperature dataset.
Double click on the group name to open up the table in a tabular format. Notice
that these data are temporal.
So this is one example of how an HDF5 file could be structured. This particular
file contains data from multiple sites, collected from different sensors (mounted
on different booms on the tower) and collected over time. Take some time to
explore this HDF5 dataset within the HDFViewer.
Part 2: Exploring Hyperspectral Imagery stored in HDF5
NEON airborne observation platform.
Next, we will explore a hyperspectral dataset, collected by the
NEON Airborne Observation Platform (AOP)
and saved in HDF5 format. Hyperspectral
data are naturally hierarchical, as each pixel in the dataset contains reflectance
values for hundreds of bands collected by the sensor. The NEON sensor
(imaging spectrometer) collected data within 428 bands.
A few notes about hyperspectral imagery:
An imaging spectrometer, which collects hyperspectral imagery, records light
energy reflected off objects on the earth's surface.
The data are inherently spatial. Each "pixel" in the image is located spatially
and represents an area of ground on the earth.
Similar to an Red, Green, Blue (RGB) camera, an imaging spectrometer records
reflected light energy. Each pixel will contain several hundred bands worth of
reflectance data.
A hyperspectral instrument records reflected light energy across
very narrow bands. The NEON Imaging Spectrometer collects 428 bands of
information for each pixel on the ground.
Read more about hyperspectral remote sensing data:
Let's open some hyperspectral imagery stored in HDF5 format to see what the file
structure can like for a different type of data.
Open the file. Notice that it is structured differently. This file is composed
of 3 datasets:
Reflectance,
fwhm, and
wavelength.
It also contains some text information called "map info". Finally it contains a
group called spatial info.
Let's first look at the metadata stored in the spatialinfo group. This group
contains all of the spatial information that a GIS program would need to project
the data spatially.
Next, double click on the wavelength dataset. Note that this dataset contains
the central wavelength value for each band in the dataset.
Finally, click on the reflectance dataset. Note that in the metadata for the
dataset that the structure of the dataset is 426 x 501 x 477 (wavelength, line,
sample), as indicated in the metadata. Right click on the reflectance dataset
and select Open As. Click Image in the "display as" settings on the left hand
side of the popup.
In this case, the image data are in the second and third dimensions of this
dataset. However, HDFView will default to selecting the first and second dimensions
**Note:** HDF5 files use 0-based indexing. Therefore,
the first dimension is called `dim 0` and the second is called `dim 1`.
Let’s tell the HDFViewer to use the second and third dimensions to view the image:
Under height, make sure dim 1 is selected.
Under width, make sure dim 2 is selected.
Notice an image preview appears on the left of the pop-up window. Click OK to open
the image. You may have to play with the brightness and contrast settings in the
viewer to see the data properly.
Explore the spectral dataset in the HDFViewer taking note of the metadata and
data stored within the file.
Consider reviewing the documentation for the RHDF5 package.
A Brief Review - About HDF5
The HDF5 file can store large, heterogeneous datasets that include metadata. It
also supports efficient data slicing, or extraction of particular subsets of a
dataset which means that you don't have to read large files read into the computers
memory/RAM in their entirety in order work with them. This saves a lot of time
when working with with HDF5 data in R. When HDF5 files contain spatial data,
they can also be read directly into GIS programs such as QGiS.
The HDF5 format is a self-contained directory structure. We can compare this
structure to the folders and files located on your computer. However, in HDF5
files "directories" are called groups and files are called datasets. The
HDF5 element itself is a file. Each element in an HDF5 file can have metadata
attached to it making HDF5 files "self-describing".
The package we'll be using is rhdf5 which is part of the
Bioconductor suite of
R packages. If you haven't installed this package before, you can use the first
two lines of code below to install the package. Then use the library command to
call the library("rhdf5") library.
# Install rhdf5 package (only need to run if not already installed)
#install.packages("BiocManager")
#BiocManager::install("rhdf5")
# Call the R HDF5 Library
library("rhdf5")
# set working directory to ensure R can find the file we wish to import and where
# we want to save our files
wd <- "~/Documents/data/" #This will depend on your local environment
setwd(wd)
Let's start by outlining the structure of the file that we want to create.
We'll build a file called "sensorData.h5", that will hold data for a set of
sensors at three different locations. Each sensor takes three replicates of two
different measurements, every minute.
HDF5 allows us to organize and store data in many ways. Therefore we need to decide
what type of structure is ideally suited to our data before creating the HDF5 file.
To structure the HDF5 file, we'll start at the file level. We will create a group
for each sensor location. Within each location group, we will create two datasets
containing temperature and precipitation data collected through time at each location.
So it will look something like this:
HDF5 FILE (sensorData.H5)
Location_One (Group)
Temperature (Dataset)
Precipitation (Dataset)
Location_Two (Group)
Temperature (Dataset)
Precipitation (Dataset)
Location_Three (Group)
Temperature (Dataset)
Precipitation (Dataset)
Let's first create the HDF5 file and call it "sensorData.h5". Next, we will add
a group for each location to the file.
# create hdf5 file
h5createFile("sensorData.h5")
## file '/Users/olearyd/Git/data/sensorData.h5' already exists.
## [1] FALSE
# create group for location 1
h5createGroup("sensorData.h5", "location1")
## Can not create group. Object with name 'location1' already exists.
## [1] FALSE
The output is TRUE when the code properly runs.
Remember from the discussion above that we want to create three location groups. The
process of creating nested groups can be simplified with loops and nested loops.
While the for loop below might seem excessive for adding three groups, it will
become increasingly more efficient as we need to add more groups to our file.
# Create loops that will populate 2 additional location "groups" in our HDF5 file
l1 <- c("location2","location3")
for(i in 1:length(l1)){
h5createGroup("sensorData.h5", l1[i])
}
## Can not create group. Object with name 'location2' already exists.
## Can not create group. Object with name 'location3' already exists.
Now let's view the structure of our HDF5 file. We'll use the h5ls() function to do this.
# View HDF5 File Structure
h5ls("sensorData.h5")
## group name otype dclass dim
## 0 / location1 H5I_GROUP
## 1 /location1 precip H5I_DATASET FLOAT 100 x 3
## 2 /location1 temp H5I_DATASET FLOAT 100 x 3
## 3 / location2 H5I_GROUP
## 4 /location2 precip H5I_DATASET FLOAT 100 x 3
## 5 /location2 temp H5I_DATASET FLOAT 100 x 3
## 6 / location3 H5I_GROUP
## 7 /location3 precip H5I_DATASET FLOAT 100 x 3
## 8 /location3 temp H5I_DATASET FLOAT 100 x 3
Our group structure that will contain location information is now set-up. However,
it doesn't contain any data. Let's simulate some data pretending that each sensor
took replicate measurements for 100 minutes. We'll add a 100 x 3 matrix that will
be stored as a dataset in our HDF5 file. We'll populate this dataset with
simulated data for each of our groups. We'll use loops to create these matrices
and then paste them into each location group within the HDF5 file as datasets.
# Add datasets to each group
for(i in 1:3){
g <- paste("location",i,sep="")
# populate matrix with dummy data
# create precip dataset within each location group
h5write(
matrix(rnorm(300,2,1),
ncol=3,nrow=100),
file = "sensorData.h5",
paste(g,"precip",sep="/"))
#create temperature dataset within each location group
h5write(
matrix(rnorm(300,25,5),
ncol=3,nrow=100),
file = "sensorData.h5",
paste(g,"temp",sep="/"))
}
Understandig Complex Code
Sometimes you may run into code (like the above code) that combines multiple
functions. It can be helpful to break the pieces of the code apart to understand
their overall function.
Looking at the first h5write() chunck above, let's figure out what it is doing.
We can see easily that part of it is telling R to create a matrix (matrix())
that has 3 columns (ncol=3) and 100 rows (nrow=100). That is fairly straight
forward, but what about the rest?
Do the following to figure out what it's doing.
Type paste(g,"precip",sep="/") into the R console. What is the result?
Type rnorm(300,2,1) into the console and see the result.
Type g into the console and take note of the result.
The rnorm function creates a set of random numbers that fall into a normal
distribution. You specify the mean and standard deviation of the dataset and R
does the rest. Notice in this loop we are creating a "precip" and a "temp" dataset
and pasting them into each location group (the loop iterates 3 times).
The h5write function is writing each matrix to a dataset in our HDF5 file
(sensorData.h5). It is looking for the following arguments: hrwrite(dataset,YourHdfFileName,LocationOfDatasetInH5File). Therefore, the code:
(matrix(rnorm(300,2,1),ncol=3,nrow=100),file = "sensorData.h5",paste(g,"precip",sep="/"))
tells R to add a random matrix of values to the sensorData HDF5 file within the
path called g. It also tells R to call the dataset within that group, "precip".
HDF5 File Structure
Next, let's check the file structure of the sensorData.h5 file. The h5ls()
command tells us what each element in the file is, group or dataset. It also
identifies the dimensions and types of data stored within the datasets in the
HDF5 file. In this case, the precipitation and temperature datasets are of type
'float' and of dimensions 100 x 3 (100 rows by 3 columns).
# List file structure
h5ls("sensorData.h5")
## group name otype dclass dim
## 0 / location1 H5I_GROUP
## 1 /location1 precip H5I_DATASET FLOAT 100 x 3
## 2 /location1 temp H5I_DATASET FLOAT 100 x 3
## 3 / location2 H5I_GROUP
## 4 /location2 precip H5I_DATASET FLOAT 100 x 3
## 5 /location2 temp H5I_DATASET FLOAT 100 x 3
## 6 / location3 H5I_GROUP
## 7 /location3 precip H5I_DATASET FLOAT 100 x 3
## 8 /location3 temp H5I_DATASET FLOAT 100 x 3
Data Types within HDF5
HDF5 files can hold mixed types of data. For example HDF5 files can store both
strings and numbers in the same file. Each dataset in an HDF5 file can be its
own type. For example a dataset can be composed of all integer values or it
could be composed of all strings (characters). A group can contain a mix of string,
and number based datasets. However a dataset can also be mixed within the dataset
containing a combination of numbers and strings.
Add Metdata to HDF5 Files
Some metadata can be added to an HDF5 file in R by creating attributes in R
objects before adding them to the HDF5 file. Let's look at an example of how we
do this. We'll add the units of our data as an attribute of the R matrix before
adding it to the HDF5 file. Note that write.attributes = TRUE is needed when
you write to the HDF5 file, in order to add metadata to the dataset.
# Create matrix of "dummy" data
p1 <- matrix(rnorm(300,2,1),ncol=3,nrow=100)
# Add attribute to the matrix (units)
attr(p1,"units") <- "millimeters"
# Write the R matrix to the HDF5 file
h5write(p1,file = "sensorData.h5","location1/precip",write.attributes=T)
# Close the HDF5 file
H5close()
We close the file at the end once we are done with it. Otherwise, next time you
open a HDF5 file, you will get a warning message similar to:
Warning message: In h5checktypeOrOpenLoc(file, readonly = TRUE) : An open HDF5 file handle exists. If the file has changed on disk meanwhile, the function may not work properly. Run 'H5close()' to close all open HDF5 object handles.
Reading Data from an HDF5 File
We just learned how to create an HDF5 file and write information to the file.
We use a different set of functions to read data from an HDF5 file. If
read.attributes is set to TRUE when we read the data, then we can also see
the metadata for the matrix. Furthermore, we can chose to read in a subset,
like the first 10 rows of data, rather than loading the entire dataset into R.
# Read in all data contained in the precipitation dataset
l1p1 <- h5read("sensorData.h5","location1/precip",
read.attributes=T)
# Read in first 10 lines of the data contained within the precipitation dataset
l1p1s <- h5read("sensorData.h5","location1/precip",
read.attributes = T,index = list(1:10,NULL))
Now you are ready to go onto the other tutorials in the series to explore more
about HDF5 files.
### Challenge: Your Own HDF5
Think about an application for HDF5 that you might have. Create a new HDF5 file
that would support the data that you need to store.
### Challenge: View Data with HDFView
Open the sensordata.H5 file in the HDFView application and explore the contents.
R is a versatile, open source programming language that was specifically
designed for data analysis. R is extremely useful for data management,
statistics and analyzing data.
This tutorial should be seem more as a reference on the basics of R and not a
tutorial for learning to use R. Here we define many of the basics, however, this
can get overwhelming if you are brand new to R.
Learning Objectives
After completing this tutorial, you will be able to:
Use basic R syntax
Explain the concepts of objects and assignment
Explain the concepts of vector and data types
Describe why you would or would not use factors
Use basic few functions
Things You’ll Need To Complete This Tutorial
You will need the most current version of R and, preferably, RStudio loaded
on your computer to complete this tutorial.
Set Working Directory: This lesson assumes that you have set your working
directory to the location of the downloaded and unzipped data subsets.
R Script & Challenge Code: NEON data lessons often contain challenges that
reinforce learned skills. If available, the code for challenge solutions is
found in the downloadable R script of the entire lesson, available in the footer
of each lesson page.
The Very Basics of R
R is a versatile, open source programming language that was specifically
designed for data analysis. R is extremely useful for data management,
statistics and analyzing data.
**Cool Fact:** R was inspired by the programming language S.
A good alternative to commercial analysis tools. R has over 5,000 user
contributed packages (as of 2014) and is widely used both in academia and
industry.
Available on all platforms.
Not just for statistics, but also general purpose programming.
Supported by a large and growing community of peers.
Introduction to R
You can use R alone or with a user interace like RStudio to write your code.
Some people prefer RStudio as it provides a graphic interface where you can see
what objects have been created and you can also set variables like your working
directory, using menu options.
We want to use R to create code and a workflow is more reproducible. We can
document everything that we do. Our end goal is not just to "do stuff" but to
do it in a way that anyone can easily and exactly replicate our workflow and
results -- this includes ourselves in 3 months when the paper reviews come back!
Code & Comments in R
Everything you type into an R script is code, unless you demark it otherwise.
Anything to the right of a # is ignored by R. Use these comments within the
code to describe what it is that you code is doing. Comment liberally in your R
scripts. This will help you when you return to it and will also help others
understand your scripts and analyses.
# this is a comment. It allows text that is ignored by the program.
# for clean, easy to read comments, use a space between the # and text.
# there is a line of code below this comment
a <- 1 + 2
Basic Operations in R
Let's take a few moments to play with R. You can get output from R simply by
typing in math
or by typing words, with the command writeLines(). Words that you want to be
recognized as text (as opposed to a field name or other text that signifies an
object) must be enclosed within quotes.
# have R write words
writeLines("Hello World")
## Hello World
We can assign our results to an object and name the object. Objects names
cannot contain spaces.
# assigning values to objects
secondsPerHour <- 60 * 60
hoursPerYear <- 365 * 24
# object names can't contain spaces. Use a period, underscore, or camelCase to
# create longer names
temp_HARV <- 90
par.OSBS <- 180
We can then return the value of an object we created.
The result of the operation on the right hand side of <- is assigned to
an object with the name specified on the left hand side of <-. The result
could be any type of R object, including your own functions (see the
Build & Work With Functions in R tutorial).
Assignment Operator: Drop the Equals Sign
The assignment operator is <-. It assigns values on the right to objects on
the left. It is similar to = but there are some subtle differences. Learn to
use <- as it is good programming practice. Using = in place of <- can lead
to issues down the line.
# this is preferred syntax
a <- 1 + 2
# this is NOT preferred syntax
a = 1 + 2
**Typing Tip:** If you are using RStudio, you can use
a keyboard shortcut for the assignment operator: **Windows/Linux: "Alt" + "-"**
or **Mac: "Option" + "-"**.
List All Objects in the Environment
Some functions are the same as in other languages. These might be familiar from
command line.
ls(): to list objects in your current environment.
rm(): remove objects from your current environment.
Now try them in the console.
# assign value "5" to object "x"
x <- 5
ls()
# remove x
rm(x)
# what is left?
ls()
# remove all objects
rm(list = ls())
ls()
Using rm(list=ls()), you combine several functions to remove all objects.
If you typed x on the console now you will get Error: object 'x' not found'.
Data Types and Structures
To make the best of the R language, you'll need a strong understanding of the
basic data types and data structures and how to operate on those. These are the
objects you will manipulate on a day-to-day basis in R. Dealing with object
conversions is one of the most common sources of frustration for beginners.
First, everything in R is an object. But there are different types of
objects. One of the basic differences in in the data structures which are
different ways data are stored.
R has many different data structures. These include
atomic vector
list
matrix
data frame
array
These data structures vary by the dimensionality of the data and if they can
handle data elements of a simgle type (homogeneous) or multiple types
(heterogeneous).
Dimensions
Homogenous
Heterogeneous
1-D
atomic vector
list
2-D
matrix
data frame
none
array
Vectors
A vector is the most common and basic data structure in R and is the workhorse
of R. Technically, vectors can be one of two types:
atomic vectors
lists
although the term "vector" most commonly refers to the atomic types not to lists.
Atomic Vectors
R has 6 atomic vector types.
character
numeric (real or decimal)
integer
logical
complex
raw (not discussed in this tutorial)
By atomic, we mean the vector only holds data of a single type.
character: "a", "swc"
numeric: 2, 15.5
integer: 2L (the L tells R to store this as an integer)
logical: TRUE, FALSE
complex: 1+4i (complex numbers with real and imaginary parts)
R provides many functions to examine features of vectors and other objects, for
example
typeof() - what is it?
length() - how long is it? What about two dimensional objects?
attributes() - does it have any metadata?
Let's look at some examples:
# assign word "april" to x"
x <- "april"
# return the type of the object
class(x)
## [1] "character"
# does x have any attributes?
attributes(x)
## NULL
# assign all integers 1 to 10 as an atomic vector to the object y
y <- 1:10
y
## [1] 1 2 3 4 5 6 7 8 9 10
class(y)
## [1] "integer"
# how many values does the vector y contain?
length(y)
## [1] 10
# coerce the integer vector y to a numeric vector
# store the result in the object z
z <- as.numeric(y)
z
## [1] 1 2 3 4 5 6 7 8 9 10
class(z)
## [1] "numeric"
A vector is a collection of elements that are most commonly character,
logical, integer or numeric.
You can create an empty vector with vector(). (By default the mode is
logical. You can be more explicit as shown in the examples below.) It is more
common to use direct constructors such as character(), numeric(), etc.
x <- vector()
# Create vector with a length and type
vector("character", length = 10)
## [1] "" "" "" "" "" "" "" "" "" ""
# create character vector with length of 5
character(5)
## [1] "" "" "" "" ""
# numeric vector length=5
numeric(5)
## [1] 0 0 0 0 0
# logical vector length=5
logical(5)
## [1] FALSE FALSE FALSE FALSE FALSE
# create a list or vector with combine `c()`
# this is the function used to create vectors and lists most of the time
x <- c(1, 2, 3)
x
## [1] 1 2 3
length(x)
## [1] 3
class(x)
## [1] "numeric"
x is a numeric vector. These are the most common kind. They are numeric
objects and are treated as double precision real numbers (they can store
decimal points). To explicitly create integers (no decimal points), add an
L to each (or coerce to the integer type using as.integer().
# a numeric vector with integers (L)
x1 <- c(1L, 2L, 3L)
x1
## [1] 1 2 3
class(x1)
## [1] "integer"
# or using as.integer()
x2 <- as.integer(x)
class(x2)
## [1] "integer"
You can also have logical vectors.
# logical vector
y <- c(TRUE, TRUE, FALSE, FALSE)
y
## [1] TRUE TRUE FALSE FALSE
class(y)
## [1] "logical"
Finally, you can have character vectors.
# character vector
z <- c("Sarah", "Tracy", "Jon")
z
## [1] "Sarah" "Tracy" "Jon"
# what class is it?
class(z)
## [1] "character"
#how many elements does it contain?
length(z)
## [1] 3
# what is the structure?
str(z)
## chr [1:3] "Sarah" "Tracy" "Jon"
You can also add to a list or vector
# c function combines z and "Annette" into a single vector
# store result back to z
z <- c(z, "Annette")
z
## [1] "Sarah" "Tracy" "Jon" "Annette"
More examples of how to create vectors
x <- c(0.5, 0.7)
x <- c(TRUE, FALSE)
x <- c("a", "b", "c", "d", "e")
x <- 9:100
x <- c(1 + (0 + 0i), 2 + (0 + 4i))
You can also create vectors as a sequence of numbers.
Inf is infinity. You can have either positive or negative infinity.
NaN means Not a Number. It's an undefined value.
Try it out in the console.
# infinity return
1/0
## [1] Inf
# non numeric return
0/0
## [1] NaN
Indexing
Vectors have positions, these positions are ordered and can be called using
object[index]
# index
z[2]
## [1] "Tracy"
# to call multiple items (a subset of our data), we can put a vector of which
# items we want in the brackets
group1 <- c(1, 4)
z[group1]
## [1] "Sarah" "Annette"
# this is especially useful with a sequence vector
z[1:3]
## [1] "Sarah" "Tracy" "Jon"
Objects can have attributes. Attribues are part of the object. These
include:
names: the field or variable name within the object
dimnames:
dim:
class:
attributes: this contain metadata
You can also glean other attribute-like information such as length()
(works on vectors and lists) or number of characters nchar() (for
character strings).
# length of an object
length(1:10)
## [1] 10
length(x)
## [1] 3
# number of characters in a text string
nchar("NEON Data Skills")
## [1] 16
Heterogeneous Data - Mixing Types?
When you mix types, R will create a resulting vector that is the least common
denominator. The coercion will move towards the one that's easiest to coerce
to.
Guess what the following do:
m <- c(1.7, "a")
n <- c(TRUE, 2)
o <- c("a", TRUE)
Were you correct?
n <- c(1.7, "a")
n
## [1] "1.7" "a"
o <- c(TRUE, 2)
o
## [1] 1 2
p <- c("a", TRUE)
p
## [1] "a" "TRUE"
This is called implicit coercion. You can also coerce vectors explicitly using
the as.<class_name>.
# making values numeric
as.numeric("1")
## [1] 1
# make values charactor
as.character(1)
## [1] "1"
# make values
as.factor(c("male", "female"))
## [1] male female
## Levels: female male
Matrix
In R, matrices are an extension of the numeric or character vectors. They are
not a separate type of object but simply an atomic vector with dimensions; the
number of rows and columns.
# create an empty matrix that is 2x2
m <- matrix(nrow = 2, ncol = 2)
m
## [,1] [,2]
## [1,] NA NA
## [2,] NA NA
# what are the dimensions of m
dim(m)
## [1] 2 2
Matrices in R are by default filled column-wise. You can also use the byrow
argument to specify how the matrix is filled.
# create a matrix. Notice R fills them by columns by default
m2 <- matrix(1:6, nrow = 2, ncol = 3)
m2
## [,1] [,2] [,3]
## [1,] 1 3 5
## [2,] 2 4 6
# set the byrow argument to TRUE to fill by rows
m2_row <- matrix(c(1:6), nrow = 2, ncol = 3, byrow = TRUE)
m2_row
## [,1] [,2] [,3]
## [1,] 1 2 3
## [2,] 4 5 6
dim() takes a vector and transform into a matrix with 2 rows and 5 columns.
Another way to shape your matrix is to bind columns cbind() or rows rbind().
# create vector with 1:10
m3 <- 1:10
m3
## [1] 1 2 3 4 5 6 7 8 9 10
class(m3)
## [1] "integer"
# set the dimensions so it becomes a matrix
dim(m3) <- c(2, 5)
m3
## [,1] [,2] [,3] [,4] [,5]
## [1,] 1 3 5 7 9
## [2,] 2 4 6 8 10
class(m3)
## [1] "matrix" "array"
# create matrix from two vectors
x <- 1:3
y <- 10:12
# cbind will bind the two by column
cbind(x, y)
## x y
## [1,] 1 10
## [2,] 2 11
## [3,] 3 12
# rbind will bind the two by row
rbind(x, y)
## [,1] [,2] [,3]
## x 1 2 3
## y 10 11 12
Matrix Indexing
We can call elements of a matrix with square brackets just like a vector, except
now we must specify a row and a column.
z <- matrix(c("a", "b", "c", "d", "e", "f"), nrow = 3, ncol = 2)
z
## [,1] [,2]
## [1,] "a" "d"
## [2,] "b" "e"
## [3,] "c" "f"
# call element in the third row, second column
z[3, 2]
## [1] "f"
# leaving the row blank will return contents of the whole column
# note: the column's contents are displayed as a vector (horizontally)
z[, 2]
## [1] "d" "e" "f"
class(z[, 2])
## [1] "character"
# return the contents of the second row
z[2, ]
## [1] "b" "e"
List
In R, lists act as containers. Unlike atomic vectors, the contents of a list are
not restricted to a single mode and can encompass any mixture of data types.
Lists are sometimes called generic vectors, because the elements of a list can
by of any type of R object, even lists containing further lists. This property
makes them fundamentally different from atomic vectors.
A list is different from an atomic vector because each element can be a
different type -- it can contain heterogeneous data types.
Create lists using list() or coerce other objects using as.list(). An empty
list of the required length can be created using vector()
x <- list(1, "a", TRUE, 1 + (0 + 4i))
x
## [[1]]
## [1] 1
##
## [[2]]
## [1] "a"
##
## [[3]]
## [1] TRUE
##
## [[4]]
## [1] 1+4i
class(x)
## [1] "list"
x <- vector("list", length = 5) ## empty list
length(x)
## [1] 5
#call the 1st element of list x
x[[1]]
## NULL
x <- 1:10
x <- as.list(x)
Questions:
What is the class of x[1]?
What about x[[1]]?
Try it out.
We can also give the elements of our list names, then call those elements with
the $ operator.
# note 'iris' is an example data frame included with R
# the head() function simply calls the first 6 rows of the data frame
xlist <- list(a = "Karthik Ram", b = 1:10, data = head(iris))
xlist
## $a
## [1] "Karthik Ram"
##
## $b
## [1] 1 2 3 4 5 6 7 8 9 10
##
## $data
## Sepal.Length Sepal.Width Petal.Length Petal.Width Species
## 1 5.1 3.5 1.4 0.2 setosa
## 2 4.9 3.0 1.4 0.2 setosa
## 3 4.7 3.2 1.3 0.2 setosa
## 4 4.6 3.1 1.5 0.2 setosa
## 5 5.0 3.6 1.4 0.2 setosa
## 6 5.4 3.9 1.7 0.4 setosa
# see names of our list elements
names(xlist)
## [1] "a" "b" "data"
# call individual elements by name
xlist$a
## [1] "Karthik Ram"
xlist$b
## [1] 1 2 3 4 5 6 7 8 9 10
xlist$data
## Sepal.Length Sepal.Width Petal.Length Petal.Width Species
## 1 5.1 3.5 1.4 0.2 setosa
## 2 4.9 3.0 1.4 0.2 setosa
## 3 4.7 3.2 1.3 0.2 setosa
## 4 4.6 3.1 1.5 0.2 setosa
## 5 5.0 3.6 1.4 0.2 setosa
## 6 5.4 3.9 1.7 0.4 setosa
What is the length of this object? What about its structure?
Lists can be extremely useful inside functions. You can “staple” together
lots of different kinds of results into a single object that a function can
return.
A list does not print to the console like a vector. Instead, each element
of the list starts on a new line.
Elements are indexed by double brackets. Single brackets will still return
a(nother) list.
Factors
Factors are special vectors that represent categorical data. Factors can be
ordered or unordered and are important for modelling functions such as lm()
and glm() and also in plot() methods. Once created, factors can only contain
a pre-defined set values, known as levels.
Factors are stored as integers that have labels associated the unique integers.
While factors look (and often behave) like character vectors, they are actually
integers under the hood. You need to be careful when treating them like strings.
Some string methods will coerce factors to strings, while others will throw an
error.
Sometimes factors can be left unordered. Example: male, female.
Other times you might want factors to be ordered (or ranked). Example: low,
medium, high.
Underlying it's represented by numbers 1, 2, 3.
They are better than using simple integer labels because factors are what are
called self describing. male and female is more descriptive than 1s and 2s.
Helpful when there is no additional metadata.
Which is male? 1 or 2? You wouldn't be able to tell with just integer data.
Factors have this information built in.
Factors can be created with factor(). Input is often a character vector.
x <- factor(c("yes", "no", "no", "yes", "yes"))
x
## [1] yes no no yes yes
## Levels: no yes
table(x) will return a frequency table counting the number of elements in
each level.
If you need to convert a factor to a character vector, simply use
To see the integer version of the factor levels, use as.numeric
as.numeric(x)
## [1] 2 1 1 2 2
To convert a factor to a numeric vector, go via a character. Compare
fac <- factor(c(1, 5, 5, 10, 2, 2, 2))
levels(fac) ## returns just the four levels present in our factor
## [1] "1" "2" "5" "10"
as.numeric(fac) ## wrong! returns the assigned integer for each level
## [1] 1 3 3 4 2 2 2
## integer corresponds to the position of that number in levels(f)
as.character(fac) ## returns a character string of each number
## [1] "1" "5" "5" "10" "2" "2" "2"
as.numeric(as.character(fac)) ## coerce the character strings to numbers
## [1] 1 5 5 10 2 2 2
In modeling functions, it is important to know what the 'baseline' level is. This
is the first factor, but by default the ordering is determined by alphanumerical
order of elements. You can change this by speciying the levels (another option
is to use the function relevel()).
# the default result (because N comes before Y alphabetically)
x <- factor(c("yes", "no", "yes"))
x
## [1] yes no yes
## Levels: no yes
# now let's try again, this time specifying the order of our levels
x <- factor(c("yes", "no", "yes"), levels = c("yes", "no"))
x
## [1] yes no yes
## Levels: yes no
Data Frames
A data frame is a very important data type in R. It's pretty much the de facto
data structure for most tabular data and what we use for statistics.
A data frame is a special type of list where every element of the list has
same length.
Data frames can have additional attributes such as rownames(), which can
be useful for annotating data, like subject_id or sample_id. But most of
the time they are not used.
Some additional information on data frames:
Usually created by read.csv() and read.table().
Can convert to matrix with data.matrix() (preferred) or as.matrix()
Coercion will be forced and not always what you expect.
Can also create with data.frame() function.
Find the number of rows and columns with nrow(dat) and ncol(dat),
respectively.
Rownames are usually 1, 2, ..., n.
Manually Create Data Frames
You can manually create a data frame using data.frame.
# create a dataframe
dat <- data.frame(id = letters[1:10], x = 1:10, y = 11:20)
dat
## id x y
## 1 a 1 11
## 2 b 2 12
## 3 c 3 13
## 4 d 4 14
## 5 e 5 15
## 6 f 6 16
## 7 g 7 17
## 8 h 8 18
## 9 i 9 19
## 10 j 10 20
Useful Data Frame Functions
head() - shown first 6 rows
tail() - show last 6 rows
dim() - returns the dimensions
nrow() - number of rows
ncol() - number of columns
str() - structure of each column
names() - shows the names attribute for a data frame, which gives the
column names.
Instead of a list of single items, a data frame is a list of vectors!
# see the class of a single variable column within iris: "Sepal.Length"
class(iris$Sepal.Length)
## [1] "numeric"
A recap of the different data types
Dimensions
Homogenous
Heterogeneous
1-D
atomic vector
list
2-D
matrix
data frame
none
array
Functions
A function is an R object that takes inputs to perform a task.
Functions take in information and may return desired outputs.
output <- name_of_function(inputs)
# create a list of 1 to 10
x <- 1:10
# the sum of all x
y <- sum(x)
y
## [1] 55
Help
All functions come with a help screen. It is critical that you learn to read the
help screens since they provide important information on what the function does,
how it works, and usually sample examples at the very bottom. You can use
help(function) or more simply ??function
# call up a help search
help.start()
# help (documentation) for a package
??ggplot2
# help for a function
??sum()
You can't ever learn all of R as it is ever changing with new packages and new
tools, but once you have the basics and know how to find help to do the things
that you want to do, you'll be able to use R in your science.
Sample Data
R comes with sample datasets. You will often find these as the date sets in
documentation files or responses to inquires on public forums like StackOverflow.
To see all available sample datasets you can type in data() to the console.
Packages in R
R comes with a set of functions or commands that perform particular sets of
calculations. For example, in the equation 1+2, R knows that the "+" means to
add the two numbers, 1 and 2 together. However, you can expand the capability of
R by installing packages that contain suites of functions and compiled code that
you can also use in your code.
The LiDAR and imagery data used to create the rasters in this dataset were
collected over the San Joaquin field site located in California (NEON Domain 17)
and processed at NEON
headquarters. The entire dataset can be accessed by request from the NEON website.
This data download contains several files used in related tutorials. The path to
the files we will be using in this tutorial is:
NEON-DS-Field-Site-Spatial-Data/SJER/.
You should set your working directory to the parent directory of the downloaded
data to follow the code exactly.
This tutorial will overview the key attributes of a raster object, including
spatial extent, resolution and coordinate reference system. When working within
a GIS system often these attributes are accounted for. However, it is important
to be more familiar with them when working in non-GUI environments such as
R or even Python.
In order to correctly spatially reference a raster that is not already georeferenced,
you will also need to identify:
The lower left hand corner coordinates of the raster.
The number of columns and rows that the raster dataset contains.
Spatial Resolution
A raster consists of a series of pixels, each with the same dimensions
and shape. In the case of rasters derived from airborne sensors, each pixel
represents an area of space on the Earth's surface. The size of the area on the
surface that each pixel covers is known as the spatial resolution of the image.
For instance, an image that has a 1 m spatial resolution means that each pixel in
the image represents a 1 m x 1 m area.
The spatial resolution of a raster refers the size of each cell
in meters. This size in turn relates to the area on the ground that the pixel
represents. Source: National Ecological Observatory Network (NEON) A raster at the same extent with more pixels will have a higher
resolution (it looks more "crisp"). A raster that is stretched over the same
extent with fewer pixels will look more blury and will be of lower resolution.
Source: National Ecological Observatory Network (NEON)
Load the Data
Let's open up a raster in R to see how the attributes are stored. We are
going to work with a Digital Terrain Model from the San Joaquin Experimental
Range in California.
# load packages
library(raster)
library(rgdal)
# set working directory to data folder
#setwd("pathToDirHere")
wd <- ("~/Git/data/")
setwd(wd)
# Load raster in an R object called 'DEM'
DEM <- raster(paste0(wd, "NEON-DS-Field-Site-Spatial-Data/SJER/DigitalTerrainModel/SJER2013_DTM.tif"))
# View raster attributes
DEM
## class : RasterLayer
## dimensions : 5060, 4299, 21752940 (nrow, ncol, ncell)
## resolution : 1, 1 (x, y)
## extent : 254570, 258869, 4107302, 4112362 (xmin, xmax, ymin, ymax)
## crs : +proj=utm +zone=11 +datum=WGS84 +units=m +no_defs
## source : /Users/olearyd/Git/data/NEON-DS-Field-Site-Spatial-Data/SJER/DigitalTerrainModel/SJER2013_DTM.tif
## names : SJER2013_DTM
Note that this raster (in GeoTIFF format) already has an extent, resolution, and
CRS defined. The resolution in both x and y directions is 1. The CRS tells us
that the x,y units of the data are meters (m).
Spatial Extent
The spatial extent of a raster, represents the "X, Y" coordinates of the corners
of the raster in geographic space. This information, in addition to the cell
size or spatial resolution, tells the program how to place or render each pixel
in 2 dimensional space. Tools like R, using supporting packages such as rgdal
and associated raster tools have functions that allow you to view and define the
extent of a new raster.
# View the extent of the raster
DEM@extent
## class : Extent
## xmin : 254570
## xmax : 258869
## ymin : 4107302
## ymax : 4112362
If you double the extent value of a raster - the pixels will be
stretched over the larger area making it look more "blury". Source: National
Ecological Observatory Network (NEON)
Calculating Raster Extent
Extent and spatial resolution are closely connected. To calculate the extent of a
raster, we first need the bottom left hand (X,Y) coordinate of the raster. In
the case of the UTM coordinate system which is in meters, to calculate
the raster's extent, we can add the number of columns and rows to the X,Y corner
coordinate location of the raster, multiplied by the resolution (the pixel size)
of the raster.
<figcaption>To be located geographically, a raster's location needs to be
defined in geographic space (i.e., on a spatial grid). The spatial extent
defines the four corners of a raster within a given coordinate reference
system. Source: National Ecological Observatory Network. </figcaption>
Let's explore that next, using a blank raster that we create.
# create a raster from the matrix - a "blank" raster of 4x4
myRaster1 <- raster(nrow=4, ncol=4)
# assign "data" to raster: 1 to n based on the number of cells in the raster
myRaster1[]<- 1:ncell(myRaster1)
# view attributes of the raster
myRaster1
## class : RasterLayer
## dimensions : 4, 4, 16 (nrow, ncol, ncell)
## resolution : 90, 45 (x, y)
## extent : -180, 180, -90, 90 (xmin, xmax, ymin, ymax)
## crs : +proj=longlat +datum=WGS84 +no_defs
## source : memory
## names : layer
## values : 1, 16 (min, max)
# is the CRS defined?
myRaster1@crs
## CRS arguments: +proj=longlat +datum=WGS84 +no_defs
Wait, why is the CRS defined on this new raster? This is the default values
for something created with the raster() function if nothing is defined.
Let's get back to looking at more attributes.
# what is the raster extent?
myRaster1@extent
## class : Extent
## xmin : -180
## xmax : 180
## ymin : -90
## ymax : 90
# plot raster
plot(myRaster1, main="Raster with 16 pixels")
Here we see our raster with the value of 1 to 16 in each pixel.
We can resample the raster as well to adjust the resolution. If we want a higher
resolution raster, we will apply a grid with more pixels within the same extent.
If we want a lower resolution raster, we will apply a grid with fewer pixels
within the same extent.
One way to do this is to create a raster of the resolution you want and then
resample() your original raster. The resampling will be done for either
nearest neighbor assignments (for categorical data) or bilinear interpolation (for
numerical data).
To more easily compare them, let's create a graphic layout with 4 rasters in it.
Notice that each raster has the same extent but each a different resolution
because it has a different number of pixels spread out over the same extent.
# change graphical parameter to 2x2 grid
par(mfrow=c(2,2))
# arrange plots in order you wish to see them
plot(myRaster2, main="Raster with 32 pixels")
plot(myRaster1, main="Raster with 16 pixels")
plot(myRaster3, main="Raster with 4 pixels")
plot(myRaster4, main="Raster with 2 pixels")
# change graphical parameter back to 1x1
par(mfrow=c(1,1))
Extent & Coordinate Reference Systems
The X and Y min and max values relate to the coordinate system
that the file is in, see below.
Coordinate Reference System & Projection Information
A spatial reference system (SRS) or coordinate reference system (CRS) is a
coordinate-based local, regional or global system used to locate geographical
entities. -- Wikipedia
The earth is round. This is not an new concept by any means, however we need to
remember this when we talk about coordinate reference systems associated with
spatial data. When we make maps on paper or on a computer screen, we are moving
from a 3 dimensional space (the globe) to 2 dimensions (our computer screens or
a piece of paper). To keep this short, the projection of a dataset relates to
how the data are "flattened" in geographic space so our human eyes and brains
can make sense of the information in 2 dimensions.
The projection refers to the mathematical calculations performed to "flatten the
data" in into 2D space. The coordinate system references to the x and y coordinate
space that is associated with the projection used to flatten the data. If you
have the same dataset saved in two different projections, these two files won't
line up correctly when rendered together.
Maps of the United States in different projections. Notice the
differences in shape associated with each different projection. These
differences are a direct result of the calculations used to "flatten" the
data onto a 2 dimensional map. Source: M. Corey, opennews.org
Check out this short video,
by Buzzfeed, highlighting how map projections can make continents
seems proportionally larger or smaller than they actually are!
What Makes Spatial Data Line Up On A Map?
There are lots of great resources that describe coordinate reference systems and
projections in greater detail. However, for the purposes of this activity, what
is important to understand is that data from the same location but saved in
different projections will not line up in any GIS or other program. Thus
it's important when working with spatial data in a program like R or Python
to identify the coordinate reference system applied to the data, and to grab
that information and retain it when you process / analyze the data.
The rgdal package has all the common ESPG codes with proj4string built in. We
can see them by creating an object of the function make_ESPG().
# make sure you loaded rgdal package at the top of your script
# create an object with all ESPG codes
epsg = make_EPSG()
# use View(espg) to see the full table - doesn't render on website well
#View(epsg)
# View top 5 entries
head(epsg, 5)
## code note prj4
## 1 3819 HD1909 +proj=longlat +ellps=bessel +no_defs +type=crs
## 2 3821 TWD67 +proj=longlat +ellps=aust_SA +no_defs +type=crs
## 3 3822 TWD97 +proj=geocent +ellps=GRS80 +units=m +no_defs +type=crs
## 4 3823 TWD97 +proj=longlat +ellps=GRS80 +no_defs +type=crs
## 5 3824 TWD97 +proj=longlat +ellps=GRS80 +no_defs +type=crs
## prj_method
## 1 (null)
## 2 (null)
## 3 (null)
## 4 (null)
## 5 (null)
Define the extent
In the above raster example, we created several simple raster objects in R.
R defaulted to a global lat/long extent. We can define the exact extent that we
need to use too.
Let's create a new raster with the same projection as our original DEM. We know
that our data are in UTM zone 11N. For the sake of this exercise, let say we
want to create a raster with the left hand corner coordinate at:
xmin = 254570
ymin = 4107302
The resolution of this new raster will be 1 meter and we will be working
in UTM (meters). First, let's set up the raster.
Now we can define the new raster's extent by defining the lower left corner of
the raster.
## Define the xmin and ymin (the lower left hand corner of the raster)
# 1. define xMin & yMin objects.
xMin = 254570
yMin = 4107302
# 2. grab the cols and rows for the raster using @ncols and @nrows
rasterNoProj@ncols
## [1] 20
rasterNoProj@nrows
## [1] 10
# 3. raster resolution
res <- 1.0
# 4. add the numbers of cols and rows to the x,y corner location,
# result = we get the bounds of our raster extent.
xMax <- xMin + (rasterNoProj@ncols * res)
yMax <- yMin + (rasterNoProj@nrows * res)
# 5.create a raster extent class
rasExt <- extent(xMin,xMax,yMin,yMax)
rasExt
## class : Extent
## xmin : 254570
## xmax : 254590
## ymin : 4107302
## ymax : 4107312
# 6. apply the extent to our raster
rasterNoProj@extent <- rasExt
# Did it work?
rasterNoProj
## class : RasterLayer
## dimensions : 10, 20, 200 (nrow, ncol, ncell)
## resolution : 1, 1 (x, y)
## extent : 254570, 254590, 4107302, 4107312 (xmin, xmax, ymin, ymax)
## crs : NA
## source : memory
## names : layer
## values : 1, 8 (min, max)
# or view extent only
rasterNoProj@extent
## class : Extent
## xmin : 254570
## xmax : 254590
## ymin : 4107302
## ymax : 4107312
Now we have an extent associated with our raster which places it in space!
# plot new raster
plot(rasterNoProj, main="Raster in UTM coordinates, 1 m resolution")
Notice that the coordinates show up on our plot now.
## Challenges: Resample Rasters
Now apply your skills in a new way!
Resample rasterNoProj from 1 meter to 10 meter resolution. Plot it next to the 1 m
resolution raster. Use: par(mfrow=c(1,2)) to create side by side plots.
What happens to the extent if you change the resolution to 1.5 when calculating
the raster's extent properties??
Define Projection of a Raster
We can define the projection of a raster that has a known CRS already. Sometimes
we download data that have projection information associated with them but the CRS
is not defined either in the GeoTIFF tags or in the raster itself. If this is the
case, we can simply assign the raster the correct projection.
Be careful doing this - it is not the same thing as reprojecting your data.
Let's define the projection for our newest raster using the DEM raster that
already has defined CRS. NOTE: in this case we have to know that our raster is
in this projection already so we don't run the risk of assigning the wrong projection
to the data.
# view CRS from raster of interest
rasterNoProj@crs
## CRS arguments: NA
# view the CRS of our DEM object.
DEM@crs
## CRS arguments:
## +proj=utm +zone=11 +datum=WGS84 +units=m +no_defs
# define the CRS using a CRS of another raster
rasterNoProj@crs <- DEM@crs
# look at the attributes
rasterNoProj
## class : RasterLayer
## dimensions : 10, 20, 200 (nrow, ncol, ncell)
## resolution : 1, 1 (x, y)
## extent : 254570, 254590, 4107302, 4107312 (xmin, xmax, ymin, ymax)
## crs : +proj=utm +zone=11 +datum=WGS84 +units=m +no_defs
## source : memory
## names : layer
## values : 1, 8 (min, max)
# view just the crs
rasterNoProj@crs
## CRS arguments:
## +proj=utm +zone=11 +datum=WGS84 +units=m +no_defs
IMPORTANT: the above code does not reproject the raster. It simply defines the
Coordinate Reference System based upon the CRS of another raster. If you want to
actually change the CRS of a raster, you need to use the projectRaster function.
### Challenge: Assign CRS
You can set the CRS and extent of a raster using the syntax
rasterWithoutReference@crs <- rasterWithReference@crs and
rasterWithoutReference@extent <- rasterWithReference@extent. Using this information:
open band90.tif in the rasterLayers_tif folder and plot it. (You could consider looking
at it in QGIS first
to compare it to the other rasters.)
Does it line up with our DEM? Look closely at the extent and pixel size. Does anything look off?
Fix what is missing.
(Advanced step) Export a new GeoTIFF Do things line up in
QGIS?
The code below creates a raster and seeds it with some data. Experiment with the
code.
What happens to the resulting raster's resolution when you change the range
of lat and long values to 5 instead of 10? Try 20, 50 and 100?
What is the relationship between the extent and the raster resolution?
## Challenge Example Code
# set latLong
latLong <- data.frame(longitude=seq( 0,10,1), latitude=seq( 0,10,1))
# make spatial points dataframe, which will have a spatial extent
sp <- SpatialPoints( latLong[ c("longitude" , "latitude") ], proj4string = CRS("+proj=longlat +datum=WGS84") )
# make raster based on the extent of your data
r <- raster(nrow=5, ncol=5, extent( sp ) )
r[] <- 1
r[] <- sample(0:50,25)
r
## class : RasterLayer
## dimensions : 5, 5, 25 (nrow, ncol, ncell)
## resolution : 2, 2 (x, y)
## extent : 0, 10, 0, 10 (xmin, xmax, ymin, ymax)
## crs : NA
## source : memory
## names : layer
## values : 3, 50 (min, max)
Reprojecting Data
If you run into multiple spatial datasets with varying projections, you can
always reproject the data so that they are all in the same projection. Python
and R both have reprojection tools that perform this task.
# reproject raster data from UTM to CRS of Lat/Long WGS84
reprojectedData1 <- projectRaster(rasterNoProj,
crs="+proj=longlat +ellps=WGS84 +datum=WGS84 +no_defs ")
# note that the extent has been adjusted to account for the NEW crs
reprojectedData1@crs
## CRS arguments: +proj=longlat +datum=WGS84 +no_defs
reprojectedData1@extent
## class : Extent
## xmin : -119.761
## xmax : -119.7607
## ymin : 37.07988
## ymax : 37.08
# note the range of values in the output data
reprojectedData1
## class : RasterLayer
## dimensions : 13, 22, 286 (nrow, ncol, ncell)
## resolution : 1.12e-05, 9e-06 (x, y)
## extent : -119.761, -119.7607, 37.07988, 37.08 (xmin, xmax, ymin, ymax)
## crs : +proj=longlat +datum=WGS84 +no_defs
## source : memory
## names : layer
## values : 0.64765, 8.641957 (min, max)
# use nearest neighbor interpolation method to ensure that the values stay the same
reprojectedData2 <- projectRaster(rasterNoProj,
crs="+proj=longlat +ellps=WGS84 +datum=WGS84 +no_defs ",
method = "ngb")
# note that the min and max values have now been forced to stay within the same range.
reprojectedData2
## class : RasterLayer
## dimensions : 13, 22, 286 (nrow, ncol, ncell)
## resolution : 1.12e-05, 9e-06 (x, y)
## extent : -119.761, -119.7607, 37.07988, 37.08 (xmin, xmax, ymin, ymax)
## crs : +proj=longlat +datum=WGS84 +no_defs
## source : memory
## names : layer
## values : 1, 8 (min, max)
Want to use plot centroid values (marking the center of a plot) in x,y format
to get the plot boundaries of a certain size around the centroid? This tutorial
is for you!
If the plot is a circle, we can generate the plot boundary using a buffer
function in R or a GIS program. However, creating a square
boundary around a centroid requires an alternate approach. This tutorial
presents a way to create square polygons of a given radius (referring to half
of the plot's width) for each plot centroid location in a dataset.
Special thanks to
jbaums
from StackOverflow for helping with the SpatialPolygons code!
Learning Objectives
After completing this activity, you will be able to:
Create square polygons around a centroid point.
Export shapefiles from R using the writeOGR() function.
Things You’ll Need To Complete This Tutorial
You will need the most current version of R and, preferably, RStudio loaded
on your computer to complete this tutorial.
This data download contains several files. You will only need the SJERPlotCentroids.csv
file for this tutorial. The path to this file is: NEON-DS-Field-Site-Spatial-Data/SJER/PlotCentroids/SJERPlotCentroids.csv .
The other data files in the downloaded data directory are used for related tutorials.
You should set your working directory to the parent directory of the downloaded
data to follow the code exactly.
Set Working Directory: This lesson assumes that you have set your working
directory to the location of the downloaded and unzipped data subsets.
R Script & Challenge Code: NEON data lessons often contain challenges that reinforce
learned skills. If available, the code for challenge solutions is found in the
downloadable R script of the entire lesson, available in the footer of each lesson page.
Our x,y coordinate centroids come in a ".csv" (Comma Separated Value) file with
the plot ID that goes with the data. The data we are using today were collected
at the National Ecological Observatory Network field site at the
San Joaquin Experimental Range (SJER) in California.
Load .csv, Setup Plots
To work with our spatial data in R, we can use the rgdal package and the
sp package. Once we've loaded these packages and set the working directory to
the where our .csv file with the data is located, we can load our data.
# load the sp and rgdal packages
library(sp)
library(rgdal)
# set working directory to data folder
#setwd("pathToDirHere")
wd <- ("~/Git/data/")
setwd(wd)
# read in the NEON plot centroid data
# `stringsAsFactors=F` ensures character strings don't import as factors
centroids <- read.csv(paste0(wd,"NEON-DS-Field-Site-Spatial-Data/SJER/PlotCentroids/SJERPlotCentroids.csv"), stringsAsFactors=FALSE)
Let's look at our data. This can be done several ways but one way is to view
the structure (str()) of the data.
# view data structure
str(centroids)
## 'data.frame': 18 obs. of 5 variables:
## $ Plot_ID : chr "SJER1068" "SJER112" "SJER116" "SJER117" ...
## $ Point : chr "center" "center" "center" "center" ...
## $ northing: num 4111568 4111299 4110820 4108752 4110476 ...
## $ easting : num 255852 257407 256839 256177 255968 ...
## $ Remarks : logi NA NA NA NA NA NA ...
We can see that our data consists of five distinct types of data:
Plot_ID: denotes the plot
Point: denotes where the point is taken -- all are centroids
northing: northing coordinate for point
easting: easting coordinate for point
Remarks: any other remarks from those collecting the data
It would be nice to have a metadata file with this .csv to confirm the coordinate
reference system (CRS) that the points are in, however, without one, based on
the numbers, we can assume it is in Universal Transverse Mercator (UTM). And
since we know the data are from the San Joaquin Experimental Range, that is in
UTM zone 11N.
Part 1: Create Plot Boundary
Now that we understand our centroid data file, we need to set how large our plots
are going to be. The next piece of code sets the "radius"" for the plots.
This radius will later be used to calculate vertex locations that define the plot
perimeter.
In this case, let's use a radius of 20m. This means that the edge of each plot
(not the corner) is 20m from the centroid. Overall this will create a 40 m x 40 m
square plot.
Units: Radius is in meters, matching the UTM CRS. If you're coordinates were in
lat/long or some other CRS than you'd need to modify the code.
Plot Orientation: Our code is based on simple geometry and assumes that plots
are oriented North-South. If you wanted a different orientation,
adjust the math accordingly to find the corners.
# set the radius for the plots
radius <- 20 # radius in meters
# define the plot edges based upon the plot radius.
yPlus <- centroids$northing+radius
xPlus <- centroids$easting+radius
yMinus <- centroids$northing-radius
xMinus <- centroids$easting-radius
When combining the coordinates for the vertices, it is important to close the
polygon. This means that a square will have 5 instead of 4 vertices. The fifth
vertex is identical to the first vertex. Thus, by repeating the first vertex
coordinate (xMinus,yPlus) the polygon will be closed.
The cbind() function allows use to combine or bind together data by column. Make
sure to create the vertices in an order that makes sense. We recommend starting
at the NE and proceeding clockwise.
# calculate polygon coordinates for each plot centroid.
square=cbind(xMinus,yPlus, # NW corner
xPlus, yPlus, # NE corner
xPlus,yMinus, # SE corner
xMinus,yMinus, # SW corner
xMinus,yPlus) # NW corner again - close ploygon
Next, we will associate the centroid plot ID, from the .csv file, with the plot
perimeter polygon that we create below. First, we extract the Plot_ID from our
data. Note that because we set stringsAsFactor to false when importing, we can
extract the Plot_IDs using the code below. If we hadn't do that, our IDs would
come in as factors and we'd thus have to use the code
ID=as.character(centroids$Plot_ID).
# Extract the plot ID information
ID=centroids$Plot_ID
We are now left with two key "groups" of data:
a dataframe square which has the points for our new 40x40m plots
a listID with the Plot_IDs for each new 40x40m plot
If all we wanted to do was get these points, we'd be done. But no, we want to
be able to create maps with our new plots as polygons and have them as spatial
data objects for later analyses.
Part 2: Create Spatial Polygons
Now we need to convert our dataframe square into a SpatialPolygon object. This
particular step is somewhat confusing. Please consider reading up on the
SpatialPolygon object in R
in
the sp package documentation (pg 86)
or check out this
StackOverflow thread.
Two general consideration:
First, spatial polygons require a list of lists. Each list contains the xy
coordinates of each vertex in the polygon - in order. It is always important
to include the closing vertex as we discussed above -- you'll have to repeat the
first vertex coordinate.
Second, we need to specify the CRS string for our new polygon. We will do this
with a proj4string. We can either type in the proj4string (as we do below) or
we can grab the string from another file that has CRS information.
To do this, we'd use the syntax:
proj4string =CRS(as.character(FILE-NAME@crs))
For example, if we imported a GeoTIFF file called "canopy" that was in a
UTM coordinate system, we could type proj4string-CRS(as.character(canopy@crs)).
Method 1: mapply function
We'll do this in two different ways. The first, using the mapply() function
is far more efficient. However, the function hides a bit of what is going on so
next we'll show how it is done without the function so you understand it.
Let's create a simple plot to see our new SpatialPolygon data.
# plot the new polygons
plot(polys)
Yay! We created polygons for all of our plots!
Method 2: Using loops
Let's do the process again with simpler R code so that we understand how the
process works. Keep in mind that loops are less efficient to process your data
but don't hide as much under the box. Once you understand how this works, we
recommend the mapply() function for your actual data processing.
# First, initialize a list that will later be populated
# a, as a placeholder, since this is temporary
a <- vector('list', length(2))
# loop through each centroid value and create a polygon
# this is where we match the ID to the new plot coordinates
for (i in 1:nrow(centroids)) { # for each for in object centroids
a[[i]]<-Polygons(list(Polygon(matrix(square[i, ], ncol=2, byrow=TRUE))), ID[i])
# make it an Polygon object with the Plot_ID from object ID
}
# convert a to SpatialPolygon and assign CRS
polysB<-SpatialPolygons(a,proj4string=CRS(as.character("+proj=utm +zone=11 +datum=WGS84 +units=m +no_defs +ellps=WGS84 +towgs84=0,0,0")))
Let's see if it worked with another simple plot.
# plot the new polygons
plot(polysB)
Good. The two methods return the same plots. We now have our new plots saved as
a SpatialPolygon but how do we share that with our colleagues? One way is to turn
them into shapefiles, which can be read into R, Python, QGIS, ArcGIS, and many
other programs.
Part 3: Export to Shapefile
Before you can export a shapefile, you need to convert the SpatialPolygons to a
SpatialPolygonDataFrame. Note that in this step you could add additional
attribute data if you wanted to!
# Create SpatialPolygonDataFrame -- this step is required to output multiple polygons.
polys.df <- SpatialPolygonsDataFrame(polys, data.frame(id=ID, row.names=ID))
Let's check out the results before we export. And we can add color this time.
plot(polys.df, col=rainbow(50, alpha=0.5))
When we want to export a spatial object from R as a shapefile, writeOGR() is a
nice function. It writes not only the shapefile, but also the associated
Coordinate Reference System (CRS) information as long as it is associated with
the spatial object (e.g., if it was identified when creating the SpatialPolygons
object).
To do this we need the following arguments:
the name of the spatial object (polys.df)
file path from the current working directory for the directory where we want
to save our shapefile. If we want it in our current directory we can simply use '.'.
3.the name of the new shapefile (2014Plots_SJER)
the driver which specifies the file format (ESRI Shapefile)
We can now export the spatial object as a shapefile.
# write the shapefiles
writeOGR(polys.df, '.', '2014Plots_SJER', 'ESRI Shapefile')
And there you have it -- a shapefile with a square plot boundary around your
centroids. Bring this shapefile into QGIS or whatever GIS package you prefer
and have a look!
R Script & Challenge Code: NEON data lessons often contain challenges that reinforce
learned skills. If available, the code for challenge solutions is found in the
downloadable R script of the entire lesson, available in the footer of each lesson page.
Packages are collections of R functions, data, and compiled code in a
well-defined format. When you install a package it gives you access to a set
of commands that are not available in the base R set of functions. The directory
where packages are stored is called the library. R comes with a standard set of
packages. Others are available for download and installation. Once installed,
they have to be loaded into the session to be used.
Installing Packages in R
To install a package you have to know where to get the package. Most established
packages are available from
"CRAN" or the Comprehensive
R Archive Network.
Packages download from specific CRAN "mirrors"" where the packages are saved
(assuming that a binary, or set of installation files, is available for your
operating system). If you have not set a preferred CRAN mirror in your
options(), then a menu will pop up asking you to choose a location from which
you'd like to install your packages.
To install any package from CRAN, you use install.packages(). You only need to
install packages the first time you use R (or after updating to a new version).
# install the ggplot2 package
install.packages("ggplot2")
**R Tip:** You can just type this into the command
line of R to install each package. Once a package is installed, you don't have
to install it again while using the version of R!
Use a Package
Once a package is installed (basically the functions are downloaded to your
computer), you need to "call" the package into the current session of R. This
is essentially like saying, "Hey R, I will be using these functions now, please
have them ready to go". You have to do this ever time you start a new R session,
so this should be at the top of your script.
When you want to call a package, use library(PackageNameHere). You may also
see some people using require() -- while that works in most cases, it does
function slightly differently and best practice is to use library().
# load the package
library(ggplot2)
What Packages are Installed Now?
If you want to use a package, but aren't sure if you've installed it before,
you can check! In code you, can use installed.packages().
# check installed packages
installed.packages()
If you are using RStudio, you can also check out the Packages tab. It will list
all the currently installed packages and have a check mark next to them if they
are currently loaded and ready to use. You can also update and install packages
from this tab. While you can "call" a package from here too by checking the box
I wouldn't recommend this as calling the package isn't in your script and you
if you run the script again this could trip you up!
Updating Packages
Sometimes packages are updated by the users who created them. Updating packages
can sometimes make changes to both the package and also to how your code runs.
** If you already have a lot of code using a package, be cautious about updating
packages as some functionality may change or disappear.**
Otherwise, go ahead and update old packages so things are up to date.
In code you, can use old.packages() to check to see what packages are out of
date.
update.packages() will update all packages in the known libraries
interactively. This can take a while if you haven't done it recently! To update
everything without any user intervention, use the ask = FALSE argument.
If you only want to update a single package, the best way to do it is using
install.packages() again.
# list all packages where an update is available
old.packages()
# update all available packages
update.packages()
# update, without prompts for permission/clarification
update.packages(ask = FALSE)
# update only a specific package use install.packages()
install.packages("plotly")
In RStudio, you can also
manage packages using Tools -> Install Packages.
Challenge: Installing Packages
Check to see if you can install the dplyr package or a package of interest to
you.
Check to see if the dplyr package is installed on your computer.
If it is not installed, install the "dplyr" package in R.
Each point in a LiDAR dataset has a X, Y, Z value and other attributes. The
points may be located anywhere in space are not aligned within any particular
grid.
Representative point cloud data. Source: National Ecological
Observatory Network (NEON)
LiDAR point clouds are typically available in a .las file format. The .las file
format is a compressed format that can better handle the millions of points that
are often associated with LiDAR data point clouds.
Common LiDAR Data Products
The Digital Terrain Model (DTM) product represents the elevation of the ground, while
the Digital Surface Model (DSM) product represents the elevation of the tallest
surfaces at that point. Imagine draping
a sheet over the canopy of a forest, the Digital Elevation Model (DEM) contours with
the heights of the trees where there are trees but the elevation of the ground when
there is a clearing in the forest.
DSM and DTM Visualizations
The Canopy height model represents the difference between a Digital Terrain Model and a
Digital Surface Model (DSM - DTM = CHM) and gives you the height of the objects (in a
forest, the trees) that are on the surface of the earth.
3D models derived from LiDAR Data. Left: Digital Terrain Model (DTM), Middle: Digital Surface Model (DSM), Right: Canopy Height Model (CHM). Source: National Ecological Observatory Network (NEON)
Gridded, or Raster, LiDAR Data Products
LiDAR data products are most often worked within a gridded or raster data format.
A raster file is a regular grid of cells, all of which are the same size.
A few notes about rasters:
Each cell is called a pixel.
And each pixel represents an area on the ground.
The resolution of the raster represents the area that each pixel represents
on the ground. So, for instance if the raster is 1 m resolution, that simple
means that each pixel represents a 1m by 1m area on the ground.
Raster or “gridded” data are stored as a grid of values which are rendered on a map as pixels. Each pixel value represents an area on the
Earth’s surface. Source: National Ecological Observatory Network (NEON)
Raster data can have attributes associated with them as well. For instance in a
LiDAR-derived digital elevation model (DEM), each cell might represent a
particular elevation value. In a LIDAR-derived intensity image, each cell
represents a LIDAR intensity value.
LiDAR Related Metadata
In short, when you go to download LiDAR data the first question you should ask
is what format the data are in. Are you downloading point clouds that you might
have to process? Or rasters that are already processed for you. How do you know?
Check out the metadata!
Look at the file format - if you are downloading a .las file, then you are
getting points. If it is .tif, then it is a post-processing raster file.
Create Useful Data Products from LiDAR Data
Classify LiDAR Point Clouds
LiDAR data points are vector data. LiDAR point clouds are useful because they
tell us something about the heights of objects on the ground. However, how do
we know whether a point reflected off of a tree, a bird, a building or the
ground? In order to develop products like elevation models and canopy height
models, we need to classify individual LiDAR points. We might classify LiDAR
points into classes including:
Ground
Vegetation
Buildings
LiDAR point cloud classification is often already done when you download LiDAR
point clouds but just know that it’s not to be taken for granted! Programs such
as lastools, fusion and terrascan are often used to perform this classification.
Once the points are classified, they can be used to derive various LiDAR data
products.
Create A Raster From LiDAR Point Clouds
There are different ways to create a raster from LiDAR point clouds.
Point to Raster Methods - Basic Gridding
Let's look one of the most basic ways to create a raster file points - basic gridding.
When you perform a gridding algorithm, you are simply calculating a value, using
point data, for each pixels in your raster dataset.
To begin, a grid is placed on top of the LiDAR data in space. Each cell in
the grid has the same spatial dimensions. These dimensions represent that
particular area on the ground. If we want to derive a 1 m resolution raster
from the LiDAR data, we overlay a 1m by 1m grid over the LiDAR data points.
Within each 1m x 1m cell, we calculate a value to be applied to that cell,
using the LiDAR points found within that cell. The simplest method of doing this
is to take the max, min or mean height value of all lidar points found within
the 1m cell. If we use this approach, we might have cells in the raster that
don't contains any lidar points. These cells will have a "no data" value if we
process our raster in this way.
Animation showing the general process of taking LiDAR point clouds and converting them to a raster format.
Source: Tristan Goulden, National Ecological Observatory Network (NEON)
Point to Raster Methods - Interpolation
A different approach is to interpolate the value for each cell.
In this approach we still start with placing the grid on top of the LiDAR
data in space.
Interpolation considers the values of points outside of the cell in addition
to points within the cell to calculate a value. Interpolation is useful because
it can provide us with some ability to predict or calculate cell values in areas
where there are no data (or no points). And to quantify the error associated with those
predictions which is useful to know, if you are doing research.
LiDAR data collected over Grand Mesa, Colorado as a part of instrument testing and calibration by the National Ecological Observatory Network 's Airborne Observation Platform (NEON AOP). Source: National Ecological Observatory Network (NEON)
LiDAR File Formats
LiDAR data are most often available as discrete points. Although, remember that these data can be collected by the lidar instrument, in either discrete or full waveform, formats. A collection of discrete return LiDAR points is known as a LiDAR point cloud.
In this tutorial, you will open a .las file, in the plas.io free online lidar data viewer. You will then explore some of the attributes associated with a lidar data point cloud.
LiDAR Attribute Data
Remember that not all lidar data are created equally. Different lidar data may have different attributes. In this tutorial, we will look at data that contain both intensity values and a ground vs non ground classification.
Plasio is a project by Uday Verma and Howard Butler that implements point cloud
rendering capability in a browser. Specifically, it provides a functional
implementation of the ASPRS LAS format, and it can consume LASzip-compressed
data using LASzip NaCl module. Plasio is Chrome-only at this time, but it is
hoped that other contributors can step forward to bring it to other browsers.
It is expected that most WebGL-capable browsers should be able to support
plasio, and it contains nothing that is explicitly Chrome-specific beyond the
optional NaCL LASzip module.
This tool is useful because you don't need to install anything to use it! Drag
and drop your lidar data directly into the tool and begin to play! The website
also provides access to some prepackaged datasets if you want to experiment on
your own.
Enough reading, let's open some NEON LiDAR data!
1. Open a .las file in plas.io
Download the NEON prepackaged lidar dataset (above in Download the Data)
if you haven't already.
The file is named: NEON-DS-Sample-LiDAR-Point-Cloud.las
When the download is complete, drag the file NEON-DS-Sample-LiDAR-Point-Cloud.las
into the plas.io website. window.
Zoom and pan around the data
Use the particle size slider to adjust the size of each individual lidar point.
NOTE: the particle size slider is located a little more than half way down the
plas.io toolbar in the "Data" section.
NICE! You should see something similar to the screenshot below:
NEON lidar data in the plas.io online tool.
Navigation in Plas.io
You might prefer to use a mouse to explore your data in plas.io. Let's test the navigation out.
Left click on the screen and drag the data on the screen. Notice that this tilts the data up and down.
Right click on the screen and drag noticing that this moves the entire dataset around
Use the scroll bar on your mouse to zoom in and out.
How The Points are Colored
Why is everything grey when the data are loaded?
Notice that the data, upon initial view, are colored in a black - white color
scheme. These colors represent the data's intensity values. Remember that the
intensity value, for each LiDAR point, represents the amount of light energy
that reflected off of an object and returned to the sensor. In this case, darker
colors represent LESS light energy returned. Lighter colors represent MORE light
returned.
Lidar intensity values represent the amount of light energy that
reflected off of an object and returned to the sensor.
2. Adjust the intensity threshold
Next, scroll down through the tools in plas.io. Look for the Intensity Scaling
slider. The intensity scaling slider allows you to define the thresholds of
light to dark intensity values displayed in the image (similar to stretching
values in an image processing software or even in Photoshop).
Drag the slider back and forth. Notice that you can brighten up the data using the slider.
The intensity scaling slider is located below the color map
tool so it's easy to miss. Drag the slider back and forth to adjust the range
of intensity values and to brighten up the lidar point clouds.
3. Change the lidar point cloud color options to Classification
In addition to intensity values, these lidar data also have a classification
value. Lidar data classification values are numeric, ranging from 0-20 or
higher. Some common classes include:
0 Not classified
1 Unassigned
2 Ground
3 Low vegetation
4 Medium vegetation
5 High Vegetation
6 Building
Blue and Orange gradient color scheme submitted by Kendra Sand.
What color scheme is your favorite?
In this case, these data are classified as either ground, or non-ground. To view the points, colored by class:
Change the "colorization" setting to "Classification
Change the intensity blending slider to "All Color"
For kicks - play with the various colormap options to change the colors of the points.
Set the colorization to 'classified' and then adjust the intensity blending to view the points, colored by ground and non-ground classification.
4. Spend Some Time Exploring - Do you See Any Trees?
Finally, spend some time exploring the data. what features do you see in this dataset? What does the topography look like? Is the site flat? Hilly? Mountainous? What do the lidar data tell you, just upon initial inspection?
Summary
The plas.io online point cloud viewer allows you to quickly view and explore lidar data point clouds.
Each lidar data point will have an associated set of attributes. You can check the metadata to determine which attributes the dataset contains. NEON data, provided above, contain both classification and intensity values.
Classification values represent the type of object that the light energy reflected off of. Classification values are often ground vs non ground. Some lidar data files might have buildings, water bodies and other natural and man-made elements classified.
LiDAR data often has an intensity value associated with it. This represents the amount of light energy that reflected off an object and returned to the sensor.