Collaborating with OTN - Advantages and Best Practices
Overview
Teaching: 30 min
Exercises: 0 minQuestions
What is the Ocean Tracking Network?
How do I take my records from fieldnotes to analysis ready data sets?
Where, when, and how do I submit data and metadata to OTN?
Objectives
Learn how OTN quality-controls and aggregates my data
Understand the benefits of sharing data, and how OTN can help
This presentation will describe how you can interact with the Ocean Tracking Network to extend your study across all the compatible telemetry activity globally, and take advantage of quality control methods and standardized formats in use by hundreds of OTN partners around the world.
Key Points
OTN is here to help you quality-control and match your telemetry data with all the other projects across their global network
Base R functions vs. Tidyverse
Overview
Teaching: 15 min
Exercises: 0 minQuestions
How can I introspect, subset, and plot my data using base R?
How do I reformat dates as ‘strings’ into date objects?
What are the Tidyverse functions that will do the same tasks?
Objectives
Learn how to perform basic R data manipulation tasks in base R, and using the Tidyverse
The Tidyverse and its component packages are reimagining how you work with tabular data in R. It includes component libraries and functions for organizing code for readability, cleaning data, and plotting it. While tidyverse is only one of a few competing paradigms for performing these jobs in R, here we will show a few examples of performing tasks in base R and again using Tidy methods.
First, some reminders on how to quickly get help with R functions.
Getting help with Functions
# Getting Help with Functions/Packages ####
?barplot
args(lm)
?lm
The Tidyverse
Intro - looking at data
OK, let’s load the Tidyverse libraries, and then load and explore a working dataset that is included in the example repository.
# Tidyverse - provides dplyr, ggplot2, and many other packages that simplify data.frame manipulation in R
library(tidyverse)
# Data and code helpfully provided by Dr. Robert Lennox of NORCE
load("seaTrout.rda") # load 1.5m sea trout detections from a dataset in the Norwegian fjords.
# Lots of ways to get a clue about what this new data var is and how it looks. The general format of our exploration will be:
# Base R way:
str(seaTrout)
# Tidyverse option:
glimpse(seaTrout)
You can also call a function to just look at the first few rows, head()
# Base R way:
head(seaTrout)
# alternate way of calling the same head() function using Tidyverse's magrittr:
seaTrout %>% head
Challenge: what are some useful arguments you can pass to the head function?
OK, let’s do some more subsetting of our data. What if we want one column? How would we return the first 5 rows?
# Select one column by number, base and then Tidy
seaTrout[,c(6)]
seaTrout %>% select(6)
# Select the first 5 rows.
seaTrout[c(1:5),]
seaTrout %>% slice(1:5)
Question: what’s this c() business all about?
Summarizing data - finding unique values:
OK, something a little more practical, how would we determine what values are in our Species column? First base, then tidy.
# basic functions 1.2: how many species do we have?
nrow(data.frame(unique(seaTrout$Species)))
seaTrout %>% distinct(Species) %>% nrow
Dealing with Dates
While a whole module could be written on this topic, here we’ll show just the difference between formatting dates using base R and using lubridate, a wonderful library for managing date formats. This is also the first place we’ll start to see some minor speed differences in using Tidy vs. base functions.
# basic function 1.3: format date times
as.POSIXct(seaTrout$DateTime) # Check if your beverage needs refilling
seaTrout %>% mutate(DateTime=ymd_hms(DateTime)) # Lightning-fast, thanks to Lubridate's ymd_hms(), 'yammed hams'!
# For datestrings in American, try dmy_hms()
Filtering data
We’ll often want to subset data to evaluate what’s happening within a subgroup. Here, let’s look at selecting out all the Trout in our data vs. the salmon. Some small speed difference here if you can detect it.
# basic function 1.4: filtering
seaTrout[which(seaTrout$Species=="Trout"),]
seaTrout %>% filter(Species=="Trout")
Plotting
We’ll do some more complex stuff with ggplot in a minute, but if you have two axes and want to see them on a plot, here are the base R and ggplot ways of getting there quickly.
# basic function 1.5: plotting
plot(seaTrout$lon, seaTrout$lat) # check again if beverage is full
seaTrout %>% ggplot(aes(lon, lat))+geom_point() # sometimes plots just take time
Summarizing Data - applying functions to get summaries
We’ll do some more complex stuff with ggplot in a minute, but if you have two axes and want to see them on a plot, here are the base R and ggplot ways of getting there quickly.
# basic function 1.6: getting data summaries
tapply(seaTrout$lon, seaTrout$tag.ID, mean) # get the mean value across a column
seaTrout %>%
group_by(tag.ID) %>%
summarise(mean=mean(lon)) # Can use pipes and newlines to organize this process
Questions:
- What arguments does tapply() take?
- What method of summarizing do you find more readable?
Key Points
Many of the tasks you will need to do to analyze your data are made easier or more readable via Tidyverse methods.
Making basic plots using ggplot
Overview
Teaching: 20 min
Exercises: 0 minQuestions
How do I make more sophisticated plots in ggplot?
Objectives
Become familiar with ggplot and dplyr’s methods to summarize and plot data
Explore how ggplot aesthetics and geometry work together
ggplot2
takes advantage of tidyverse pipes and chains of data manipulation as well as separating the aesthetics of the plot (what are we plotting) from the styling of the plot (how should we show it?), in order to produce readable and malleable plotting code.
Let’s start with a practical example. First we’ll plot the basic shape of this data summary, then we’ll look at applying more style choices to improve our plot’s readability.
# monthly longitudinal distribution of salmon smolts and sea trout
seaTrout %>%
group_by(m=month(DateTime), tag.ID, Species) %>%
summarise(mean=mean(lon)) %>%
ggplot(aes(m %>% factor, mean, colour=Species, fill=Species))+ # the data is supplied, but no info on how to show it!
geom_point(size=3, position="jitter")+ # draw data as points, and use jitter to help see all points instead of superimposition
coord_flip()+
scale_colour_manual(values=c("grey", "gold"))+ # change the color palette to reflect species a bit better
scale_fill_manual(values=c("grey", "gold"))+
geom_boxplot()+
geom_violin(colour="black")
After we apply all the styling, our grouped time factor’s on the Y axis to highlight the longitudinal change that we’re showing on the X axis, and we’re seeing box plots and violins on top of the ‘raw’ data points to provide additional context. We’ve also made a few style choices to ensure we can tease apart all these overlapping plots a bit better.
There are other ways to present a summary of data like this that we might have chosen. geom_density2d()
will give us a KDE for our data points and give us some contours across our chosen plot axes.
seaTrout %>%
group_by(m=month(DateTime), tag.ID, Species) %>%
summarise(mean=mean(lon)) %>%
ggplot(aes(m, mean, colour=Species, fill=Species))+
geom_point(size=3, position="jitter")+
coord_flip()+
scale_colour_manual(values=c("grey", "gold"))+
scale_fill_manual(values=c("grey", "gold"))+
geom_density2d(size=2, lty=1)
Here we start to potentially see why we might like to use multiple plots for each subset, or facets, for our two distinct species, as they’re hard to see on top of one another in this way. Switching to stat_density_2d will fill in my levels (and obliterate my ability to see the underlying data points). I’m also going to use labs()
to properly label my axes.
seaTrout %>%
group_by(m=month(DateTime), tag.ID, Species) %>%
summarise(mean=mean(lon)) %>%
ggplot(aes(m, mean))+
stat_density_2d(aes(fill = stat(nlevel)), geom = "polygon")+
#geom_point(size=3, position="jitter")+
coord_flip()+
facet_wrap(~Species)+
scale_fill_viridis_c() +
labs(x="Mean Month", y="Longitude (UTM 33)")
Facets are a great way to highlight differences across your groups, and the most obvious next choice for a grouping is by individual tagged animal. Be aware of how many plots you are going to end up with!
# per-individual density contours - lots of facets!
seaTrout %>%
ggplot(aes(lon, lat))+
stat_density_2d(aes(fill = stat(nlevel)), geom = "polygon")+
facet_wrap(~tag.ID)
So reminder, this is all just exploratory work so far. Using the big individuals plot here, we could identify interesting individuals to subset away for further exploration, or pick out the potential non-survivors and subset them away from the pack. But we’ve worked mostly with summaries of movement data so far, taking advantage of what we know about our domain without actually looking at it yet. Next we’ll do something a bit more spatially-aware.
Key Points
You can feed output from dplyr’s data manipulation functions into ggplot using pipes.
Plotting various summaries and groupings of your data is good practice at the exploratory phase, and dplyr and ggplot make iterating different ideas straightforward.
Loading, formatting, and plotting spatial data objects
Overview
Teaching: 20 min
Exercises: 0 minQuestions
How do I plot my data on a map if my coordinates aren’t in latitude and longitude?
Where can I find environmental data for my study area (like bathymetry)?
Objectives
Learn how to register the spatial component of your dataset, creating a spatial object
Learn how to re-project spatial data objects into different coordinate systems
You may have noticed that our latitude and longitude columns were in a surprising unit. UTM coordinate systems have some positive aspects that are useful, but some desireable activities, like plotting them alongside rasters of bathymetric data, or with any other dataset that has lat/lon coordinates, requires conversion to ‘true’ latitude and longitude values. So let’s convert our dataset to be a spatial data frame with decimal-degree latitude and longitude as its coordinate system.
# Say that stS is a copy of our seaTrout data
stS<-seaTrout
coordinates(stS)<-~lon+lat # tell stS that it's a spatial data frame with coordinates in lon and lat
proj4string(stS)<-CRS("+proj=utm +zone=33 +ellps=WGS84 +datum=WGS84 +units=m +no_defs") # in reference system UTM 33
We’d like to project this into longlat - but the spTransform() can take this to any projection / coordinate reference system we might need.
# Project it to a different, strange CRS.
st<-spTransform(stS, CRS("+init=epsg:28992")) # what is epsg:28992 - https://epsg.io/28992 - netherlands/dutch topographic map in easting/northing
st<-data.frame(st) # results aren't a data.frame by default though.
View(st) # It knows the coords should go in lon and lat columns, but they're actually still easting/northing as per the CRS we chose.
# Project it to the CRS we want it to be:
st<-spTransform(stS, CRS("+proj=longlat +datum=WGS84")) # ok let's put it in lat/lon WGS84
st<-data.frame(st)
View(st)
So now we have a spatially-aware data frame, using lat and lon as its coordinates in the appropriate CRS. As we want to see the underlying study area, we can use the min/max values of our latitude and longitude columns to subset a world-scale dataset, or to fetch a terrain/bathymetric map from a public data source. We’ll do this for the NOAA-provided ETOPO1 bathymetric data, using functions provided by a library called marmap
.
x=.5 # padding for our bounds
st <- st %>% as_tibble(st) # put our data back into tibble form explicitly
bgo <- getNOAA.bathy(lon1 = min(st$lon-x), lon2 = max(st$lon+x),
lat1 = min(st$lat-x), lat2 = max(st$lat+x), resolution = 1) # higher resolutions are very big.
# now we have bgo, let's explore it.
class(bgo); bgo %>% class # same %>% concept as before, these are just two ways of doing the same thing
# let's see what it looks like:
plot(bgo) # what's a 'bathy' object?
plot(bgo, col="royalblue")
autoplot(bgo)
As a housekeeping step, we’ll often use fortify
to first turn the raster into an x,y,z data.frame and then optionally as_tibble
to make it explicitly a tibble. This makes our plotting functions happy, and works for any raster file! Here’s the syntax to turn bgo
into a x,y,z dataframe and then into a tibble.
bgo %>% fortify %>% as_tibble
A few helper functions that we’ll use for our bathymetry map can be plussed into the ggplot call.
scale_fill_etopo()
will give us a sensible colorbar for our bathy and topographic data. and ggplot
provides a few theme_xxxx()
functions for pre-setting a bunch of style options.
# Let's really lay on the style with ggplot
bgo %>%
fortify %>%
ggplot(aes(x, y, fill=z))+
geom_raster() + # raster-fy - plot to here first, see the values on a blue-scale. This isn't the fjords of Norway....
scale_fill_etopo() + # color in the pixels with marmap's scale_fill_etopo to see the positive values more clearly.
labs(x="Longitude", y="Latitude", fill="Depth")+
theme_classic()+
theme(legend.position="top")+
theme(legend.key.width = unit(5, "cm"))
You can save the output of your ggplot chain to a variable instead of displaying it immediately. We’ll come back to this bathymetric map of our domain later on and lay some more stuff on top of it, so let’s save it to bplot to make that easy on ourselves.
We’re also going to plot our data points on here, just the distinct lat and lon values to avoid overplotting, but now we’ll know where all our listening stations are. A point of order here. Our bathymetry has a z value in its aesthetic, but our sub-plotted receiver points don’t have that! To avoid ggplot chastening us for this ‘missing’ positional data, we tell it we’re being different with the aesthetics using inherit.aes=F
in our call to geom_point()
.
bplot<-bgo %>% # Save the output of the plot call to a variable instead of showing it.
fortify %>% # Turn a raster into a data.frame
ggplot(aes(x, y, z=z, fill=z))+
geom_raster()+ # raster-ify
scale_fill_etopo()+ # color in the pixels
labs(x="Longitude", y="Latitude", fill="Depth")+
theme_classic()+
theme(legend.position="top") +
geom_point(data=st %>%
as_tibble() %>%
distinct(lon, lat),
aes(lon, lat), inherit.aes=F, pch=21, fill="red", size=2)+ # don't inherit aesthetics from the parent,
# original map is lat/lon/depth, points without Z vals will fail to plot!
theme(legend.key.width = unit(5, "cm"))
bplot # now that it's in bplot, here's how you can show it if you need to.
Hang onto bplot
in your environment for now. We’ll use it later on!
Key Points
More Tidyverse functions useful for telemetry data analysis
Overview
Teaching: 15 min
Exercises: 0 minQuestions
What dplyr functions are useful to analyze my data?
What are some example workflows for analyzing telemetry datasets
Objectives
Learn to create new metrics from the base data using dplyr.
Create and plot our summary statistics, or save them to new columns and use them in further calculations.
Now that we have an idea of what an exploratory workflow might look like with Tidyverse libraries like dplyr
and ggplot2
, let’s look at how we might implement a common telemetry tidying workflow using these tools.
# Filtering and data processing
st<-as_tibble(st) # make sure st is a tibble again for speed's sake
# One pipe to fix dates,
# filter subsets,
# calculate lead and lag locations,
# and calculate new derived values, in this case, bearing and distance.
st<-st %>% # overwrite the dataframe with the result of this pipe!
mutate(dt=ymd_hms(DateTime)) %>%
dplyr::select(-1, -2, -DateTime) %>% # Don't return DateTime, or column 1 or 2 of the data
# filter(month(dt)>5 & month(dt)<10) %>% # filter for just a range of months?
arrange(dt) %>%
group_by(tag.ID) %>%
mutate(llon=lag(lon), llat=lag(lat)) %>% # lag longitude, lag latitude (previous position)
filter(lon!=lag(lon)) %>% # If you didn't change positions, drop this row.
rowwise() %>%
filter(!is.na(llon)) %>% # Also drop any NA lag longitudes (i.e. the first detection of each)
mutate(bearing=argosfilter::bearing(llat, lat, llon, lon)) %>% # use mutate and argosfilter to add bearings!
mutate(dist=argosfilter::distance(llat, lat, llon, lon)) # use mutate and argosfilter to add distances!
View(st)
So there’s a lot going on here. First, we’re going to write the result of our pipe right back into the source dataset. This is good for saving memory, but bad for rapid reproducability/tinkering with the workflow. If we get it wrong, we have to go back and re-instantiate our dataset before trying again. So you may not want to jump to doing this right away.
So in our pipe chain here, we are doing a lot of the things we saw earlier. We’re fixing up the date object into a new column dt
using lubridate
. We’re throwing out the first two columns, as well as the old DateTime
string column. We’re potentially filtering on dt
, picking a range of months to keep. We’re re-indexing the result with arrange(dt)
before we start grouping to ensure that everything is in temporal order. We group_by()
tag.ID, which is a stand-in for individual in this dataset. Then we use mutate()
again within our grouped data along with lag()
to produce new variables llat
and llon
. The lag()
function operates on each group, grabbing the previous location (in time) for each animal, and storing it in two new columns. With this position and the previous position, we can calculate a distance and bearing between them. Now, this isn’t a real distance or bearing for the trip between these points, that’s not how acoustic detections work, we’ll never say ‘the animal traveled exactly X metres along this path between detections’ but there might be a pattern to uncover using these measurements.
st %>%
group_by(tag.ID) %>%
mutate(cdist=cumsum(dist)) %>%
ggplot(aes(dt, cdist, colour=tag.ID))+ geom_step()+
facet_wrap(~Species) +
guides(colour=F)
Now that we have our distance and bearing data, we can do things like calculate the total distance traveled per animal. Which as we mentioned, is more of a lower bound than a true measure, but especially in well-gated rivers could produce a useful metric.
st %>%
filter(dist>2) %>%
ggplot(aes(bearing, fill=Species))+
# geom_histogram()+ # could do a histogram
geom_density()+ # and/or a density plot
facet_wrap(~Species) + # one facet per species
coord_polar()
This filter-and-plot by group removes all distances less than 2 (to take away movements within a multi-receiver gate, or multiple detections within the range of two adjacent receivers), and then creates a polar plot showing the dominant bearings for each individual. Because we’re moving longitudinally within a river, we see a dominant east-west trend. And splitting on species shows us there may be differences in how they’re using the river system. To prove a hypothesis like this to ourselves, we’d undergo a lot more filtering and individual-based analysis to see if any detection anomalies or hyper-active individuals are dominating the result.
Key Points
Network analysis using detections of animals at stations
Overview
Teaching: 15 min
Exercises: 0 minQuestions
How do I prepare my data in order to apply network analysis?
Objectives
One of the more popular analysis domains for movement data has been network analysis. Networks are a system of nodes, and the edges that connect them. There are a few ways to think about animal movement data in this context, but perhaps an obvious one, that we’ve set ourselves up to use via our previous data exploration, is considering the receiver locations as nodes, and the animals travel between them as edges. More trips between receivers, larger and more significant edges. More trips to a receiver from the rest of the network, more centrality for that receiver ‘node’.
# 5.1: Networks are just connections between nodes and we can draw a simple one using animals traveling between receivers.
st %>%
group_by(Species, lon, lat, llon, llat) %>% # we handily have a pair of locations on each row from last example to group by
summarise(n=n()) %>% # count the number of rows in each group
ggplot(aes(x=llon, xend=lon, y=llat, yend=lat))+ # xend and yend define your segments
geom_segment()+geom_curve(colour="purple")+ # and geom_segment() and geom_curve() will connect them
facet_wrap(~Species)+
geom_point() # draw the points as well to make them clear.
So ggplot
will show us our travel between nodes using xend
and yend
, and we can choose a couple ways to display those connections using geom_segment()
and geom_curve()
. Splitting on species can show us the different ways each species uses the network, but the scale of the values of edges and nodes is hard to differentiate. Let’s switch our plot scale for the edges to a log scale.
st %>%
group_by(Species, lon, lat, llon, llat) %>%
summarise(n=n()) %>%
ggplot(aes(x=llon, xend=lon, y=llat, yend=lat, size=n %>% log))+
geom_point() + # Put this on the bottom of the plot.
geom_segment()+geom_curve(colour="purple")+
facet_wrap(~Species)
So we pass n to the log function in our argument to aes()
, and that gives us a much clearer context for which edges are dominating. Speaking of adding context, let’s bring back bplot
as our backdrop for this and see where our receivers are geospatially.
bplot+ # we saved this earlier when doing bathymetry plotting
geom_segment(data=st %>%
group_by(Species, lon, lat, llon, llat) %>%
summarise(n=n()),
aes(x=llon, xend=lon, y=llat, yend=lat, size=n %>% log, alpha=n %>% log), inherit.aes=F)+ # bplot has Z, nothing else does, so inherit.aes=F to ignore missing parent aesthetic values
facet_wrap(~Species) # we also scale alpha because we're going to force a lot of these relationship lines on top of one another with this method.
So to keep the map visible and to see the effect of lines that overlap heavily because we’re forcing it to spatial bounds, we have alpha being set to log(n) as well, alpha is our transparency setting. We also have to leverage inherit.aes=F again, because our network values don’t have a Z axis like our bplot.
You can take things further into the specifics of network analysis with this data and the igraph
and ggraph
packages (but it’s too big a subject for this tutorial!), but when building the source data for it as we’ve done here, you would want to decide whether you’re intending to see trends in individuals, species, by other variables, whether the regions are your nodes or individuals are your nodes, whether a few highly detected individuals are providing most of the story in a summary like the one we made here, etc.
Key Points
Animating ggplot with gganimate and gifski
Overview
Teaching: 10 min
Exercises: 0 minQuestions
How do I animate my tracks?
Objectives
Isolate one individual and animate his paths between receivers using gganimate and gifski.
You can extend ggplot
with gganimate
to generate multiple plots, and stitch them together into an animation. In the glatos
package, we’ll use ffmpeg to make videos out of these static images, but you can also generate a gif using gifski
.
## Chapter 6: Animating plots ####
# Let's pick one animal to follow
st1<-st %>% filter(tag.ID=="A69-1601-30617") # another great time to check hydration levels
an1<-bgo %>%
fortify %>%
ggplot(aes(x, y, fill=z))+
geom_raster()+
scale_fill_etopo()+
labs(x="Longitude", y="Latitude", fill="Depth")+
theme_classic()+
theme(legend.key.width=unit(5, "cm"), legend.position="top")+
theme(legend.position="top")+
geom_point(data=st %>%
as_tibble() %>%
distinct(lon, lat),
aes(lon, lat), inherit.aes=F, pch=21, fill="red", size=2)+
geom_point(data=st1 %>% filter(tag.ID=="A69-1601-30617"),
aes(lon, lat), inherit.aes=F, colour="purple", size=5)+ # from here, this plot is not an animation yet. an1
transition_time(date(st1$dt))+
labs(title = 'Date: {frame_time}') # Variables supplied to change with animation.
Now that we have the plots in an1, we can animate them by handing them to gganimate::animate()
# an1 is now a list of plot objects but we haven't plotted them.
?gganimate::animate # To go deeper into gganimate's animate function and its features.
gganimate::animate(an1)
Notably: we’re doing a lot of portage! The perils of working in a winding river system, or around land masses is that our straight-line interpolations plain look silly when you animate them this way.
Later we’ll use the glatos
package to help us dodge land masses better in our transitions.
Key Points
Introduction to GLATOS Data Processing
Overview
Teaching: 30 min
Exercises: 0 minQuestions
How do I load my data into GLATOS?
How do I filter out false detections?
How can I consolidate my detections into detection events?
How do I summarize my data?
Objectives
GLATOS is a powerful toolkit that provides a wide range of functionality for loading, processing, and visualizing your data. With it, you can gain valuable insights with quick and easy commands that condense high volumes of base R into straightforward functions, with enough versatility to meet a variety of needs.
First, we must set our working directory and import the relevant library.
## Set your working directory ####
setwd("./code/glatos/")
library(glatos)
Your code may not be in the ‘code/glatos’ folder, so use the appropriate file path for your data.
Next, we will create paths to our detections and receiver files. GLATOS can function with both GLATOS and OTN-formatted data, but the functions are different for each. Both, however, provide a marked performance boost over base R, and Both ensure that the resulting data set will be compatible with the rest of the glatos framework.
We will use the NSBS Blue Shark detections that come bundled with GLATOS. Please read the system.file documentation to understand how to load your own data from its location. You may not need to use the system.file command to build the path if the data is stored in the same place you’re running your script from.
# Get file path to example blue shark OTN data
det_file_name <- system.file("extdata", "blue_shark_detections.csv",
package = "glatos")
# Receiver Location
rec_file_name <- system.file("extdata", "hfx_deployments.csv",
package = "glatos")
Remember: you can always check a function’s documentation by typing a question mark, followed by the name of the function.
## GLATOS help files are helpful!! ####
?read_otn_detections
With our file path in hand, we’ll want to use the read_otn_detections function to load our data into a dataframe. In this case, our data is formatted in the OTN style- if it were GLATOS formatted, we would want to use read_glatos_detections() instead.
# Save our detections file data into a dataframe called detections
detections <- read_otn_detections(det_file=det_file_name)
Remember that we can use head() to inspect a few lines of our data to ensure it was loaded properly.
# View first 2 rows of output
head(detections, 2)
We can do the same for our receivers with the read_otn_deployments function.
?read_otn_deployments
# Save receiver information into receivers dataframe
receivers <- read_otn_deployments(rec_file_name)
head(receivers, 2)
With our data loaded, we next want to apply a false filtering algorithm to reduce the number of false detections in our dataset. GLATOS uses the Pincock algorithm to filter probable false detections based on the time lag between detections- tightly clustered detections are weighted as more likely to be true, while detections spaced out temporally will be marked as false.
## Filtering False Detections ####
## ?glatos::false_detections
# write the filtered data (no rows deleted, just a filter column added)
# to a new det_filtered object
detections_filtered <- false_detections(detections, tf=3600, show_plot=TRUE)
head(detections_filtered)
nrow(detections_filtered)
The false_detections function will add a new column to your dataframe, ‘passed_filter’. This contains a boolean value that will tell you whether or not that record passed the false detection filter. That information may be useful on its own merits; for now, we will just use it to filter out the false detections.
# Filter based on the column if you're happy with it.
detections_filtered <- detections_filtered[detections_filtered$passed_filter == 1,]
nrow(detections_filtered) # Smaller than before
With our data properly filtered, we can begin investigating it and developing some insights. GLATOS provides a range of tools for summarizing our data so that we can better see what our receivers are telling us.
We can begin with a summary by animal, which will group our data by the unique animals we’ve detected.
# Summarize Detections ####
# By animal ====
sum_animal <- summarize_detections(detections_filtered, summ_type='animal')
sum_animal
We can also summarize by loation, grouping our data by distinct locations.
# By location ====
sum_location <- summarize_detections(detections_filtered, location_col='glatos_array', summ_type='location')
head(sum_location)
Finally, we can summarize by both dimensions.
# By both dimensions
sum_animal_location <- summarize_detections(det = detections_filtered,
location_col = 'glatos_array',
summ_type='both')
head(sum_animal_location)
One other method- we can summarize by a subset of our animals as well. If we only want to see summary data for a fixed set of animals, we can pass an array containing the animal_ids that we want to see summarized.
# create a custom vector of Animal IDs to pass to the summary function
# look only for these ids when doing your summary
tagged_fish <- c("NSBS-Alison", "NSBS-Brandy", "NSBS-Hey Jude")
sum_animal_custom <- summarize_detections(det=detections_filtered,
animals=tagged_fish,
summ_type='animal')
sum_animal_custom
Alright, we can summarize our data. Let’s move on and see if we can make our dataset more amenable to plotting by reducing it from detections to detection events.
Detection Events differ from detections in that they condense a lot of temporally and spatially clustered detections for a single animal into a single detection event. This is a powerful and useful way to clean up the data, and makes it easier to present and clearer to read. Fortunately, GLATOS lets us to this easily.
# Reduce Detections to Detection Events ####
# ?glatos::detection_events
# arrival and departure time instead of multiple detection rows
# you specify how long an animal must be absent before starting a fresh event
events <- detection_events(detections_filtered,
location_col = 'station', # combines events across different receivers in a single array
time_sep=432000)
head(events)
We can also keep the full extent of our detections, but add a group column so that we can see how they would have been condensed.
# keep detections, but add a 'group' column for each event group
detections_w_events <- detection_events(detections_filtered,
location_col = 'station', # combines events across different receivers in a single array
time_sep=432000, condense=FALSE)
With our filtered data in hand, let’s move on to some visualization.
Key Points
Basic Visualization and Plotting
Overview
Teaching: 30 min
Exercises: 0 minQuestions
Objectives
We can use GLATOS to quickly and effectively visualize our data, now that we’ve cleaned it up.
One of the simplest ways is to use an abacus plot to display animal detections against the appropriate stations.
# Visualizing Data - Abacus Plots ####
# ?glatos::abacus_plot
# customizable version of the standard VUE-derived abacus plots
abacus_plot(detections_w_events,
location_col='station',
main='NSBS Detections By Station') # can use plot() variables here, they get passed thru to plot()
This is good, but cluttered. We can also filter out a single animal ID and plot only the abacus plot for that.
# pick a single fish to plot
abacus_plot(detections_filtered[detections_filtered$animal_id== "NSBS-Alison",],
location_col='station',
main="NSBS-Alison Detections By Station")
Additionally, if we only wanted to plot for a subset of receivers, we could do That as well, using the following code to isolate a particular subset. This is less useful for our Blue Shark dataset, and more applicable to a dataset structured like GLATOS, where the glatos_array column would link multiple receivers together into an array, which we could then select on. In this case, we’re only using our stations, so it’s less important.
# subset of receivers?
receivers
receivers_subset <- receivers[receivers$station %in% c('HFX005', 'HFX031', 'HFX040'),]
receivers_subset
det_first <- min(detections$detection_timestamp_utc)
det_last <- max(detections$detection_timestamp_utc)
receivers_subset <- receivers_subset[
receivers_subset$deploy_date_time < det_last &
receivers_subset$recover_date_time > det_first &
!is.na(receivers_subset$recover_date_time),] #removes deployments without recoveries
locs <- unique(receivers_subset$station)
locs
In keeping with earlier lessons, we could also do the same selection using dplyr’s filter() method.
receivers
receivers_subset <- receivers[receivers$station %in% c('HFX005', 'HFX031', 'HFX040'),]
receivers_subset
det_first <- min(detections$detection_timestamp_utc)
det_last <- max(detections$detection_timestamp_utc)
receivers_subset <- filter(receivers_subset, deploy_date_time < det_last &
recover_date_time > det_first &
!is.na(receivers_subset$recover_date_time))
receivers_subset
Note that the row indices are different- filter automatically resets them to 1-3, whereas selecting them with base R preserves the original numbering. Use whichever best suits your needs.
We can also plot abacus plots with receiver histories, which can give us a better idea of in which order our tracked fish travelled through our receivers. We just have to pass a few extra parameters to the abacus plot function, including our receivers array so that it builds out the history.
Note that we must add a glatos_array column to receivers before we can plot it in this way- the function still expects GLATOS data. For our purposes, it is enough to use the station column, but different variables may suit your data differently.
# Abacus Plots w/ Receiver History ####
# Using the receiver data frame from the start:
# See the receiver history behind the detections to know what you could see.
receivers$glatos_array = receivers$station
abacus_plot(detections_filtered[detections_filtered$animal_id == 'NSBS-Hooker',],
pch = 16,
type='b',
receiver_history=receivers,
location_col = 'station')
If we want to see actual physical distribution, a bubble plot will serve us better.
Before we can plot this data properly, we need to download a shapefile of Nova Scotia. This will give us a map on which we can plot our data. We can get a suitable Shapefile for Nova Scotia from GADM, the Global Administrative boundaries reference. The following code will retrieve first the country, then the province:
Canada <- getData('GADM', country="CAN", level=1)
NS <- Canada[Canada$NAME_1=="Nova Scotia",]
With the map generated, we can pass it to the bubble plot and see the results.
# Bubble Plots for Spatial Distribution of Fish ####
# bubble variable gets the summary data that was created to make the plot
detections_filtered
bubble <- detection_bubble_plot(detections_filtered,
location_col = 'station',
map = NS,
col_grad=c('white', 'green'),
background_xlim = c(-66, -62),
background_ylim = c(42, 46))
There are additional customisations we can perform, which let us tune the output of the bubble plot function to better suit our needs. A few of the parameters are demonstrated below, but we encourage you to investigate the documentation and see what suits Your needs.
# more complex example including zeroes by adding a specific
# receiver locations dataset, in this case the receivers dataset above.
bubble_custom <- detection_bubble_plot(detections_filtered,
location_col='station',
map = NS,
background_xlim = c(-63.75, -63.25),
background_ylim = c(44.25, 44.5),
symbol_radius = 0.7,
receiver_locs = receivers,
col_grad=c('white', 'green'))
Key Points
Basic Animations
Overview
Teaching: 30 min
Exercises: 0 minQuestions
Objectives
With just a little extra code, we can go from static visualizations right into animations with the functionality provided by the glatos package.
First, we need to import a couple more libraries that will let us manipulate polygons and raster objects. With those, we’ll be able to build a map for our plot.
library(sp) #for loading greatLakesPoly
library(raster) #for raster manipulation (e.g., crop)
We’ll also re-use our shapefile of the NS coastline from earlier.
Canada <- getData('GADM', country="CAN", level=1)
NS <- Canada[Canada$NAME_1=="Nova Scotia",]
For the purposes of this workshop, we’ll only track one fish. You can select a single fish’s detections in a certain timeframe with the code below, changing the dates and animal IDs as befits your data.
# extract one fish and subset date ####
det <- detections[detections$animal_id == 'NSBS-Hooker' &
detections$detection_timestamp_utc > as.POSIXct("2014-01-01") &
detections$detection_timestamp_utc < as.POSIXct("2014-12-31"),]
The polygon we’ve loaded is quite large, so we want to crop it down to a more reasonable area. Otherwise, we run the risk that our code will time out, exhaust RStudio’s memory allocation, or just take too long to be practical. We can use the crop() function, and pass it our polygon, and a lon/lat extent to crop to- in this case, we’ll get an area around Halifax Harbour, where our detections were picked up.
# crop polygon to just outside Halifax Harbour ####
halifax <- crop(NS, extent(-66, -62, 42, 45))
Now that we have a polygon representing the appropriate area, we can plot it. We also want to put our points representing our fish detections n the plot, which we can do with the points() function.
plot(halifax, col = "grey")
points(deploy_lat ~ deploy_long, data = det, pch = 20, col = "red",
xlim = c(-66, -62))
That might look OK, but if we’re going to animate it, we need more- since we need to be able to interpolate the paths points to properly make the animation, we need to know if our interpolations make any sense. To do that, we need to know whether our paths cross land at any point. And to do that, we need to know what on our map is water, and what on our map is land. Fortunately, glatos lets us do this with the make_transition() function.
make_transition interpolates the paths between points using a least-cost path. In our case, the only two costs are 0- if the path only goes through water- or infinity, if the path crosses land. This is hypothetically extensible, however, to a broader set of behaviours, if multiple rasters were used and costs were differently weighted.
make_transition generates two objects- a transition layer, and a raster object, both representing the land and water on the map we’re plotting on. However, take note! Since glatos was originally designed to run on data from lakes, it will treat any fully-bounded polygon as a lake. This means that depending on your shapefile, glatos may mistake your land for water, and vice-versa!
Fortunately, we can get around this. make_transition takes a parameter, ‘invert’, that defaults to false. If it is true, then make_transition() will return the inverse of the raster it otherwise would have returned. If it is treating your land masses as water, this should fix the problem.
tran <- make_transition_flipped(halifax, res = c(0.1, 0.1))
Finally, we have our transition layer and our raster. Let’s plot it against our polygon and see how it looks.
plot(tran$rast, xlim = c(-66, -62), ylim = c(42, 45))
plot(halifax, add = TRUE)
That doesn’t look great- the resolution isn’t very high, and the water area (green by default) definitely doesn’t match up with the coastline. We can re-run make_transition with a higher resolution, however, to provide a better match.
# not high enough resolution- bump up resolution
tran1 <- make_transition_flipped(halifax, res = c(0.001, 0.001))
Keep in mind that increasing the resolution will make the function take a longer time to run, and on large polygons, the operation can time out. It’s up to you to find the right balance of speed and detail as appropriate for your dataset.
# plot to check resolution- much better
plot(tran1$rast, xlim = c(-66, -62), ylim = c(42, 45))
plot(halifax, add = TRUE)
Much better!
The next step is to add our points to the map. For this, we’ll break out the unique lat/lon pairs from our data and only plot those. That will cut down on the size of the data set we’re adding and make our rendering speedier.
# add fish detections to make sure they are "on the map"
# plot unique values only for simplicity
foo <- unique(det[, c("deploy_lat", "deploy_long")])
points(foo$deploy_long, foo$deploy_lat, pch = 20, col = "red")
We’re finally ready to interpolate our points, which we can do with the interpolate_path() function. We pass it our detections, our transition layer, and a specification for how it should output its results- in this case, as a data.table. This may not be right for your dataset, so be sure to check.
# call with "transition matrix" (non-linear interpolation), other options ####
# note that it is quite a bit slower due than linear interpolation
pos2 <- interpolate_path(det, trans = tran1$transition, out_class = "data.table")
At last, we can do a quick sanity check, plotting our interpolated data against our map and making sure all our points are in the right place- that is, the water.
plot(halifax, col = "grey")
points(latitude ~ longitude, data = pos2, pch=20, col='red', cex=0.5)
We’re finally ready to generate our animation. Using the make_frames function, we can generate, at once, a series of still frames representing the animation, and the animation itself.
Note that we have to pass some extra parameters to the make_frames() function. Since glatos was written for the Great Lakes, by default it will use those as its background and lat/lon. We can change the former with bg_map, which we set as the ‘halifax’ object we’ve been using, and background_xlim and background_ylim to set the lon/lat properly- otherwise, even if we have the right map, our dots won’t display.
We also have ‘overwrite’ set to TRUE, which just means that if the files already exist, they’ll be overwritten. That’s fine here, but you may want to set yours differently.
# Make frames out of the data points ####
# ?make_frames
# just gimme the animation!
#setwd('~/code/glatos')
frameset <- make_frames(pos2, bg_map=halifax, background_ylim = c(42, 45), background_xlim = c(-66, -62), overwrite=TRUE)
Excellent! A brief but accurate animation of our data points.
You may find glatos’ animation options restrictive, or might not like the output they generate. If that’s the case, you can set the ‘animate’ parameter to FALSE and just get the frames, which you can then take into your own animation software to tweak as you see fit. You can also supply an out_dir to specify where the animations and frames get written to, if you want it to be somewhere other than your working directory.
# can set animate = FALSE to just get the composite frames
# to take into your own program for animation
#
pos1 <- interpolate_path(det)
frameset <- make_frames(pos1, bg_map=halifax, background_ylim = c(42, 45), background_xlim = c(-66, -62), out_dir=paste0(getwd(),'/anim_out'), overwrite=TRUE)
Key Points
Basic Tag and Receiver Simulations using GLATOS
Overview
Teaching: 30 min
Exercises: 0 minQuestions
Objectives
A lession reviewing the simulation capabilities of glatos
for evaluating tag programming and multiple tag/receiver interactions is in development and will be coming soon!
In the interim, you may want to review the relevant functions, and the examples in the glatos
package vignettes here:
https://gitlab.oceantrack.org/GreatLakes/glatos/tree/master/vignettes
or Joseph Stachelek’s online documentation of the GLATOS package here: https://rdrr.io/github/jsta/glatos/man/calc_collision_prob.html
Key Points
Fitting Hidden Markov Models with moveHMM
Overview
Teaching: 30 min
Exercises: 0 minQuestions
What can applying hidden markov models to my movement data using moveHMM tell me about the behaviour of my animals?
Objectives
Fit a hidden markov model.
Picking an appropriate time step.
Picking starting values.
Interpretation of model results.
Assessing model fit.
Lesson 1 - fit an HMM
############################
### lesson 1: fit an HMM ###
# picking time step
# picking starting values
# interpretation
# assessing model fit
############################
# let's bring in the data for the first HMMs
load("pikedat.rda")
ls()
# so we have data from the hydrophones, data for the lake, and data for the two models fit to different fish
# these are yaps output, so we need to keep in mind that we're going to be fitting more models to model output
# here's an initial view of the data (thanks Henrik for the code!)
plot(Y~X, data=lake, col="black", asp=1, type="l") # lake
points(y~x, data=hydros, pch=20, cex=2, col="navy") # receivers
lines(y~x, data=yaps1315, col="cadetblue")
lines(y~x, data=yaps1335, col="tomato")
Let’s take a look at all of the variables in our dataset.
head(yaps1315)
Here x is the Easting, and y is the Northing, that means we’re projected, it’s in UTM32N - epsg32632 (projection)
- sd_x and sd_y are the standard deviations associated with the YAPS model
- top is the time of the ping
- nobs is the number of receivers that detected each ping
- velocity is the instantaneous speed of sound
Let’s first start just by plotting the data. Since we’re looking at predicted values from a model, I would also like to look at a confidence interval around the predictions (normally distributed. Let’s just use 95% CI)
plot(yaps1335$x, yaps1335$y, type="l")
points(yaps1335$x-1.96*yaps1335$sd_x, yaps1335$y-1.96*yaps1335$sd_y,
type='l', col='skyblue')
points(yaps1335$x+1.96*yaps1335$sd_x, yaps1335$y+1.96*yaps1335$sd_y,
type='l', col='skyblue')
plot(yaps1315$x, yaps1315$y, type="l")
points(yaps1315$x-1.96*yaps1315$sd_x, yaps1315$y-1.96*yaps1315$sd_y,
type='l', col='skyblue')
points(yaps1315$x+1.96*yaps1315$sd_x, yaps1315$y+1.96*yaps1315$sd_y,
type='l', col='skyblue')
Neither of these confidence intervals look very wide, which is a good sign
I also like to take a look at each location axis against time, which tells me a bit about whether they are continuously spaced, whether there are any gaps, and a bit about the behaviours
plot(x ~top, data=yaps1335, type="p")
plot(y ~top, data=yaps1335, type="p")
In these first two plots, we see no gaps, and nothing really looks out of the ordinary, although there is almost a cyclical change, and the spread gets wider as time goes on
plot(x ~top, data=yaps1315, type="p")
plot(y ~top, data=yaps1315, type="p")
These next two plots show some interesting behaviour: it looks like the animal might be showing some clear residency behaviour where it’s not changing it’s position at all and then shows some clear directed movement.
Pike are burst swimmers, so we know of at least three behaviours, a sit-and-wait, a burst, and a likely movement between the sit-and-wait spots, but we might not be able to pick up the burst behaviour because it’s really quick.
Movement and Behavioural States
The first thing that we need to do is interpolate the data onto a regular time interval in order to do this, we need to pick a time step there are a couple of different schools of thought about this one is to pick a time step that will give you a number of interpolated locations close to the number of observations that you have
Let’s try that first, so we need to take a look at the distribution of the temporal intervals. When working with time, note that just the regular diff
function will pick the units for you. If you want to specify the units then you need to use difftime
and supply two vectors.
diff(yaps1315$top) %>% as.numeric() %>% density() %>% plot()
# check out a density plot of the values
difftime(yaps1315$top[-1], yaps1315$top[-nrow(yaps1315)], units="secs") %>% as.numeric() %>% density() %>% plot()
# uniformly distributed from 60 to 120, makes sense (random burst interval tag)
difftime(yaps1315$top[-1], yaps1315$top[-nrow(yaps1315)], units="secs") %>% mean()
# a value of 90s would probably be good to start!
# now let's interpolate the data
tempinterp <- function(anim, ts){
# ts is time step in seconds
# anim is animal with datetime in posix, then longitude, then latitude
# colnames are "date", "x", and "y"
t0 = anim$date[1] # first time step
tT = anim$date[nrow(anim)] # last time step
t = seq(t0, tT, by=ts) # sequence of time steps
# interpolate between the two
loc <- data.frame(date=t,
x = approx(anim$date, anim$x, xout = t)$y,
y = approx(anim$date, anim$y, xout = t)$y)
}
# interpolate the first pike's data
yaps1315 %>% rename(date=top) %>% tempinterp(ts=90) -> pike
head(pike)
# note that the times are all 90 seconds apart
# check that all the points lie on the path
yaps1315 %>% ggplot(aes(x=x, y=y)) + geom_path() + geom_point(aes(x, y), data=pike, col='cadetblue')
Now that we have locations occuring in regular time, the next step is to decompose the track into step lengths and turning angles. There are a few ways to do this, and different packages do it in different ways. It’s really important to understand your projection (or lack thereof) with moveHMM, we need to use the prepData
function.
pike_prep <- prepData(pike, type="UTM")
# type can be either UTM or LL - use UTM for easting/northing (because we are projected)
# use coordNames argument if your locations are named other than the default (x, y)
head(pike_prep)
tail(pike_prep)
Here we see a couple of things -
- there is an animal ID that is automatically created - we’ll get back to this later
- step is the step length - these are in m because our original locations are in m
- angle is the turning angle - these are in radians (multiply by 180/pi if you want degrees) note a couple of NAs, because it takes two locations to calculate a step length and three to calculate a turning angle. Also note the alignment of these NAs suggests that the step length from times t-1 to t and the angle between times t-1, t, and t+1 will be informing the same state
Now let’s take a look at the distributions. moveHMM
has a generic plotting function for the processed data
plot(pike_prep)
# i also like to look at the densities and histograms side by side
par(mfrow=c(2,2))
density(pike_prep$step, na.rm=TRUE) %>% plot(main="Step Length Density")
hist(pike_prep$step, main = "Step Length Histogram")
density(pike_prep$angle, na.rm=TRUE) %>% plot(main="Turning Angle Density")
hist(pike_prep$angle, main = "Turning Angle Histogram")
Our objective is to look for multiple states. In modelling terms, this means that we are assuming that these observations shouldn’t be modelled with just one underlying probability distribution, but multiple i.e., these empirical densities actually contain multiple parametric densities within them.
In an HMM, we want to estimate these multiple densities, and the states that relate to them.
An optimization routine is an algorithm that seeks to find the optimum from a function. moveHMM
uses nlm
, markmodmover
uses nlminb
. In order to optimize a function, you need to have starting values. If your function is bumpy, then it can be harder to find a global optimum, so you want to be careful about your choice of starting values, and it’s a good idea to check multiple sets
To pick starting values, I like to overlay densities on my histograms. For moveHMM
, the default densities are gamma (step length) and von mises (turning angle)
?dgamma # shape and rate parameter
# rate is inverse of scale, so as rate goes up, the spread goes down
# shape is eponymous it literally determines the shape of the distribution
# get a sense of the distribution with the following plots
par(mfrow=c(1,1))
curve(dgamma(x,1,10), col="black", lwd=2)
curve(dgamma(x,2,10), add=TRUE, col="royalblue", lwd=2)
curve(dgamma(x,5,10), add=TRUE, col="cadetblue", lwd=2)
curve(dgamma(x,10,10), add=TRUE, col="mediumseagreen", lwd=2)
curve(dgamma(x,2,10), col="black", lwd=2)
curve(dgamma(x,2,5), add=TRUE, col="royalblue", lwd=2)
curve(dgamma(x,2,2), add=TRUE, col="cadetblue", lwd=2)
curve(dgamma(x,2,1), add=TRUE, col="mediumseagreen", lwd=2)
?circular::dvonmises #mean mu and concentration kappa
# mu determines where the angles are centered
# concentration determines how concentrated the distribution is around the mean
# let's play
curve(circular::dvonmises(x,0,1), from=-pi, to=pi, ylim=c(0,1), col="black", lwd=2)
curve(circular::dvonmises(x,0,5), add=TRUE, col="royalblue", lwd=2)
curve(circular::dvonmises(x,pi,1), add=TRUE, col="cadetblue", lwd=2)
curve(circular::dvonmises(x,pi,5), add=TRUE, col="mediumseagreen", lwd=2)
# pretty clear that the mean determines the location, and the concentration
# determines the spread
# now we need to pick a set of starting values
# there is a great guide from Théo Michelot and Roland Langrock:
# https://cran.r-project.org/web/packages/moveHMM/vignettes/moveHMM-starting-values.pdf
# i like to overlay a few different sets on histograms of my data
# for the step lengths, we're normally looking for a distribution that is more
# concentrated around the smaller step lengths, and another that is more
# spread which encapsulates the longer (but less dense) step lengths
hist(pike_prep$step, breaks=20, freq=FALSE, main = "pike step lengths")
curve(dgamma(x,0.8,rate=1), n=200, add=TRUE, col="royalblue", lwd=4)
curve(dgamma(x,1.5,rate=0.5), n=200, add=TRUE, col="cadetblue", lwd=4)
curve(dgamma(x,1.2,rate=0.01), n=200, add=TRUE, col="navyblue", lwd=4)
# maybe a third state to capture the outliers?
# for the turning angles
# here we have a distribution that is fairly clear in terms of the
# two states - typically, we like to look for a state with a lot of observations
# around 0, which is indicative of directed movement, and then another flatter
# distribution to encapsulate all of the other directions, which suggests more
# random or undirected movement. Sometimes, as in this case, we might actually
# see a bump at the edges (around pi and -pi) suggesting that there is a state
# with course reversals
hist(pike_prep$angle, freq=FALSE, main = "pike turning angles")
curve(circular::dvonmises(x,pi,0.3), n=200, add=TRUE, col="royalblue", lwd=4)
curve(circular::dvonmises(x,0,2.5), n=200, add=TRUE, col="cadetblue", lwd=4)
# now that we have chosen some starting values, we can fit our first HMM!
# note that moveHMM parameterizes the gamma in terms of the mean and standard
# deviation, so we need to change our parameters over
# mean = shape/rate
# sd = shape^(1/2)/rate
# can find these formulas on wikipedia page
sl_init_mean <- c(1.5/0.5, 0.8/1)
sl_init_sd <- c(sqrt(1.5)/0.5, sqrt(0.8)/1)
ta_init_mean <- c(0, pi)
ta_init_con <- c(2.5, 0.3)
# note that the order matters! The first parameters for each observation
# apply to the first state, and so on...
mod <- fitHMM(data = pike_prep,
nbStates = 2,
stepPar0 = c(sl_init_mean, sl_init_sd),
anglePar0 = c(ta_init_mean, ta_init_con),
formula = ~1,
stepDist = "gamma",
angleDist = "vm")
# and it fitted! let's check out the results
# model summary
mod
# state 1: directed movement with long step lengths
# state 2: undirected movement with shorter step lengths
# transition probability coefficients are in terms of the probabilities of switching to
# a different state
# intial distribution is of the behavioural states
# model plots
plot(mod)
# model confidence intervals
CI(mod)
# model states
states <- viterbi(mod)
states
data.frame(val = rle(states)$values, n = rle(states)$lengths) %>%
ggplot(aes(val %>% factor, n)) + geom_violin()
# look at the distributions runs of each state
# runs in state 2 generally look to be a bit longer
# model validation
mod %>% plotPR()
# note that there is a fair amount of autocorrelation, and a bit of
# deviation from the QQ line. There are a couple of things that we can try.
# 1) coarser time scale (but might shift our interpretation of the states)
# 2) different starting values (in case we're not on the optimum)
# 3) accounting for autocorrelation using markmodmover
# try a coarser time scale
# 4.5 for 4.5 mins
yaps1315 %>% rename(date=top) %>% tempinterp(ts=90*3) -> pike4.5
# plot the two datasets
p1 <- yaps1315 %>% ggplot(aes(x=x, y=y)) +
geom_path() +
geom_point(aes(x, y), data=pike, col="cadetblue")
p2 <- yaps1315 %>% ggplot(aes(x=x, y=y)) +
geom_path() +
geom_point(aes(x, y), data=pike4.5, col="tomato")
grid.arrange(p1, p2, ncol = 1)
# prep the new data and fit the model
pike_prep4.5 <- prepData(pike4.5, type="UTM")
mod4.5 <- fitHMM(data = pike_prep4.5,
nbStates = 2,
stepPar0 = c(sl_init_mean, sl_init_sd),
anglePar0 = c(ta_init_mean, ta_init_con),
formula = ~1,
stepDist = "gamma",
angleDist = "vm")
mod4.5 %>% plotPR()
# reduced the autocorrelation a bit at the later lags!
# but we still have some at the earlier lags
# suggests that acf is partly because of the interpolation
# try different starting values
# do this for the coarser time scale
sl_init_mean <- sl_init_sd <- runif(2, min(pike_prep4.5$step, na.rm=TRUE), max(pike_prep4.5$step, na.rm=TRUE))
# in the guide it suggests picking an sd of a similar value to the mean
ta_init_mean <- runif(2, -pi, pi)
ta_init_con <- runif(2, 0, 10)
# note that the order matters! The first parameters for each observation
# apply to the first state, and so on...
mod_inits2 <- fitHMM(data = pike_prep4.5,
nbStates = 2,
stepPar0 = c(sl_init_mean, sl_init_sd),
anglePar0 = c(ta_init_mean, ta_init_con),
formula = ~1,
stepDist = "gamma",
angleDist = "vm")
mod_inits2 %>% plotPR() # looks the same
# check how close they are
mod4.5$mle$stepPar
mod_inits2$mle$stepPar
# so this didn't help us (in terms of residuals),
# parameter values are the same which is a good sign that we are at the global optimum
# of the likelihood
Key Points
Accounting For ACF
Overview
Teaching: 30 min
Exercises: 0 minQuestions
Objectives
Coming soon!
####################################
### lesson 2: accounting for acf ###
####################################
# to try to better account for acf we will use markmodmover
# markmodmover models the mean of the step length as a function of the previous
# step length, thus accounting for the acf
# do note that some of this acf can be caused by the linear interpolation of the data
# markmodmover uses S4 classes, which are a bit more formal than the S3 classes that
# are very frequently used, and that are used by moveHMM. This is just a different way
# of calling functions in R. Here's a link to a really good explanation of the different
# object oriented classes available in R
# http://adv-r.had.co.nz/OO-essentials.html
# so we can check out the vignette for markmodmover:
vignette("markmodmover")
# this gives us a nice schematic of the package workflow. just like in movehmm,
# you start with your data, you put it in the correct format using a package function,
# then you fit the model. you can also tweak some of the model fitting options
# using setmodel4m, and you can simulate from your fitted model (to assess accuracy)
# let's start by putting our data in the correct format. This package was designed
# for irregularly spaced (in time) data, so we can just feed data4M the original data frame
# and then use a package interpolation function
# however, this package only takes data in lats and lons. our data are projected,
# so we need to "unproject" them!
# awesome overview to projections here: https://www.nceas.ucsb.edu/~frazier/RSpatialGuides/OverviewCoordinateReferenceSystems.pdf
# i'm using rgdal, lots of people are using sf now which is supposed to be a bit more readable
# below is example code of reading in a shapefile, but i've added it to the rda file anyways
# world <- readOGR(dsn="~/Documents/Shapefiles", layer="World") #natural earth
yaps1315 %>%
select(x, y) %>%
as.matrix() %>%
SpatialPoints(proj4string=CRS("+init=epsg:32632")) -> pike_proj
pike_proj %>% spTransform(CRS("+proj=longlat")) -> pike_unproj
plot(world, xlim=c(0, 11), ylim=c(50, 60), col="cadetblue")
points(pike_unproj, type='l', lwd=5, col="tomato") #there we are in Denmark!
# (i typically plot it to make sure it worked)
# so now we use the data4M() function to prep the data
# it has a bit of flexibility in the naming of the data, i use the shortest possible (laziness)
# so i renamed to lon and lat
pike_unproj@coords %>%
as.data.frame() %>%
mutate(date = yaps1315$top) %>%
dplyr::select(date, x, y) %>%
rename(lon=x, lat=y) %>%
data4M() -> pike
pike
# so there are three sections, one describing the observed locations,
# one describing the time differences, and one describing the interpolated locations
# the interpolated locations are empty, we need to pick a time step!
# the time differences are calculated in hours
90/3600 # the mean so let's stick with 90s for our time step
# markmodmover has its own interpolation function
pike <- interpolate(pike, Time.Step=90/3600) # takes the time step in hours
pike
# now under interpolated locations we have some summary info
# in markmodmover, missing data (gaps in time) are dealt with by splitting the track
# into groups and then fitting the same HMM to each group (setting parameter estimates
# to be equal across tracks). A group cutoff (the interval used to determine when to
# split the track up) is automatically chosen, see Lawler et al. (2019) for details.
# you can adjust this using the Group.Cutoff argument
# here we have only 1 group, so we didn't need to split the track up at all.
# let's plot the new data
str(pike) # gives a sense of the structure of s4
# have to use the @ symbol to access slots
# or some of these have functions to access the slots
# we could have used observedLocations(pike) instead
plot(Lat~Lon, data=pike@Observed.Locations, type="l")
points(Lat~Lon, data=pike@Interpolated.Locations, pch=20, col='royalblue')
# everything is lying on the path again
# you can also use plot()
plot(pike)
# The last two plots are interesting, but hard to see right now because there are a few
# outlying step lengths
# we can clone the data and take them out to take a closer look
# (note that will change the plot a bit though)
# this gives us a list of the outliers, we'll take out step lengths >2 based on this and the axes
pike@Movement.Data$Movement.Data$Step.Length %>% boxplot.stats()
pike4viz <- pike
pike4viz@Movement.Data$Movement.Data <- pike4viz@Movement.Data$Movement.Data %>% filter(Step.Length < 2)
plot(pike4viz, y="data")
# These are both 2 dimensional kernal density estimators (using MASS::kde2d)
# the first one is basically a kde of each of the step vectors centered at 0
# the second is a kde of the step length at time t vs at time t-1, which
# can be used to help figure out what kind of model to fit - if it looks
# like a continuous line along y=x, then use the carHMM. if it looks like
# a series of droplets along y=x, then use a regular HMM. See Lawler et al. 2019.
# looks to me like suggesting we should use carHMM, which corroborates previous evidence
# now that the data are in the correct format, we can fit the model!
# with this model, starting values are picked randomly
# it also uses a wrapped cauchy for a turning angle distribution
# but you can pick between a gamma and log normal for the step lengths (gamma is default)
# here i stick with default
mod_ac <- fit(pike)
mod_ac
# note a couple of things
# wrapped cauchy distribution here is parameterized with a scale parameter between 0 and 1
# higher value -> more concentrated
# mean of the gamma distribution is mean = (1-ac)*reversion.level + ac*previous.step.length
# so it changes with every step, it's not just dependent on the state anymore
# state definitions have switched, compare with previous mod again
mod
# in previous models, the longer step lengths were associated with state 1, now associated with state 2
# tpms are pretty similar tho!
states_ac <- viterbiPath(mod_ac)
data.frame(val = rle(states_ac)$values, n = rle(states_ac)$lengths) %>%
ggplot(aes(val %>% factor, n)) + geom_violin()
# looks pretty similar, but note some noticeably longer step lengths in the residency state
# now we take a look at the model diagnostics using plot90 again
plot(mod_ac)
# we can also compare the pseudoresiduals from the two models
# note both models used interpolated locations on a 90s scale
par(mfrow=c(2,1))
pseudoRes(mod)$stepRes %>% acf(main="moveHMM", na.action=na.pass)
mod_ac@Residuals$Step.Length %>% acf(main="markmodmover")
# acf is pretty reduced, but you can still see some
# we also have AIC with markmodmover
mod_ac$AIC
Key Points
Choosing Between Numbers of States
Overview
Teaching: 30 min
Exercises: 0 minQuestions
Objectives
###################################################
### lesson 3: picking between numbers of states ###
###################################################
# so we don't have a great model fit
# we also already thought that there might be a third behaviour in there
# so now lets try adding more states
# check from states 3:6
mods_ac <- list()
mods_ac[["nstates2"]] <- mod_ac
# fit a bunch of models
for(i in 3:6) mods_ac[[paste("nstates", i, sep="")]] <- fit(pike, N.States=i)
# so we get a couple warnings about our convergence, let's check out which models had problems
lapply(mods_ac, function(x)x@Convergence)
mods_ac[["nstates6"]]<- NULL # can get rid of the ones with bad convergence like so
mods_ac[["nstates5"]]<- NULL
# codes 3, 4, and 5 are okay, anything else isn't
# to choose between we can look at the AIC
lapply(mods_ac, function(x)x@AIC)
# notice that the aic in this case is almost always going down. there is a bunch of research that
# suggests that when you use aic to select amongst numbers of behavioural states, it often
# favours the model with more states. so while we can use this as some information in
# picking how many states we need, it shouldn't be the only information. we should
# also be looking at the overall model fit (residuals), and thinking about biological relevance
# we can take a look at the deltaAIC
lapply(mods_ac, function(x)x@AIC) %>% unlist() %>% diff()
# 2d plots of step length autocorrelation
par(mfrow=c(2,2))
# h really can affect our perception, level of smoothness (bandwidth)
for(i in 1:3){
resdens <- MASS::kde2d(head(mods_ac[[i]]@Residuals$Step.Length,-1),
tail(mods_ac[[i]]@Residuals$Step.Length,-1),
lims = c(-1,1,-1,1), h=0.2)
image(resdens, col=grey(seq(1,0,length.out = 10)), main = names(mods_ac)[[i]],
xlab='step length at time t', ylab = 'step length at time t-1')
}
# should look like random scatter
# gets better with more states
# step length qqplots
par(mfrow=c(2,2))
for(i in 1:3){
qqplot(y = residuals(mods_ac[[i]])$Step.Length,
x = seq(-1,1,2/nrow(residuals(mods_ac[[i]]))),
xlab = "theoretical", ylab='predicted', main = names(mods_ac)[[i]])
abline(a = 0, b = 1, col = "red")
}
# should follow 1-1 line
# turning angle qqplots
par(mfrow=c(2,2))
for(i in 1:3){
qqplot(y = residuals(mods_ac[[i]])$Deflection.Angle,
x = seq(-1,1,2/nrow(residuals(mods_ac[[i]]))),
xlab = "theoretical", ylab='predicted', main = names(mods_ac)[[i]])
abline(a = 0, b = 1, col = "red")
}
# should follow 1-1 line
# that little divet in the middle is weird, maybe we need some zero inflation?
par(mfrow=c(2,2))
for(i in 1:3) acf(residuals(mods_ac[[i]])$Step.Length, main = names(mods_ac)[[i]])
# i would say we improve from time step 2 to 3, maybe improve a bit more from
# time step 3 to 4
# so now we take into account the parameter estimates and the biology of the animal
plot(mods_ac[['nstates3']])
mods_ac[['nstates3']]
states_ac <- viterbiPath(mods_ac[['nstates3']])
data.frame(val = rle(states_ac)$values, n = rle(states_ac)$lengths) %>%
ggplot(aes(val %>% factor, n)) + geom_violin()
### state 1
# one turning angle with low scale centered at pi
# paired with low reversion level, and low acf
# suggests residency/low movement behaviour
### state 2
# one turning angle with medium scale centered at 0
# paired with higher reversion level and higher acf
# lower level of movement, maybe they're slowing down and looking for a good hideout?
### state 3
# one turning angle with high scale centered at 0
# paired with higher step lengths and high acf
# highly directed movement
### tpm
# good amount of time spent in each state, but highly unlikely to switch between
# states 1 and 3
# suggests to me that state 1 is the sit-and-wait behaviour, state 3 is directed travel
# between places, and state 2 is exploratory/transitioning behaviour between the other two states
# always good to be able to validate these behaviours (e.g. cameras)
### states
# stay in state 1 a lot longer than others, also suggests sitting and waiting
# i think unlikely to get the actual burst behaviour
# (too quick and won't travel far enough and won't stay in it long enough)
plot(mods_ac[['nstates4']])
mods_ac[['nstates4']]
# state 1: centered at 0 with low scale, plus v low step length reversion level and no acf
# state 2: centeered at 0 with medium scale, with slightly larger reversion level and low acf
# state 3: centered at pi, medium scale, similar reversion level ti state 2 and a bit of acf
# state 4: centered at 0, high scale, large reversion level and acf
# tpm getting harder to interpret
states_ac <- viterbiPath(mods_ac[['nstates4']])
data.frame(val = rle(states_ac)$values, n = rle(states_ac)$lengths) %>%
ggplot(aes(val %>% factor, n)) + geom_violin()
par(mfcol=c(2,1))
plot(mods_ac[['nstates3']], y='locations')
plot(mods_ac[['nstates4']], y='locations')
# bottom line is that you need to be able to explain the behaviours, so since i
# am at my limit of pike knowledge i would pick three states
Key Points