Background
Overview
Teaching: 5 min
Exercises: 0 minQuestions
What is the Ocean Tracking Network?
How does your local telemetry network interact with OTN?
What methods of data analysis will be covered?
Objectives
Intro to OTN
The Ocean Tracking Network (OTN) supports global telemetry research by providing training, equipment, and data infrastructure to our large network of partners.
OTN and affiliated networks provide automated cross-referencing of your detection data with other tags in the system to help resolve “mystery detections” and provide detection data to taggers in other regions. OTN’s Data Managers will also extensively quality-control your submitted metadata for errors to ensure the most accurate records possible are stored in the database. OTN’s database and Data Portal website are excellent places to archive your datasets for future use and sharing with collaborators. We offer pathways to publish your datasets with OBIS, and via open data portals like ERDDAP, GeoServer etc. The data-product format returned by OTN is directly ingestible by analysis packages such as glatos and resonATe for ease of analysis. OTN offers support for the use of these packages and tools.
Learn more about OTN and our partners here https://members.oceantrack.org/. Please contact OTNDC@DAL.CA if you are interested in connecting with your regional network and learning about their affiliation with OTN.
Intended Audience
This set of workshop material is directed at researchers who are ready to begin the work of acoustic telemetry data analysis. The first few lessons will begin with introductory R - no previous coding experince required. The workshop material progresses into more advanced techniques as we move along, beginning around lesson 8 “Introduction to Glatos”.
If you’d like to refresh your R coding skills outside of this workshop curriculum, we recommend Data Analysis and Visualization in R for Ecologists as a good starting point. Much of this content is included in the first two lessons of this workshop.
Getting Started
Please follow the instrucions in the “Setup” tab along the top menu to install all required software, packages and data files. If you have questions or are running into errors please reach out to OTNDC@DAL.CA for support.
Intro to Telemetry Data Analysis
OTN-affiliated telemetry networks all provide researchers with pre-formatted datasets, which are easily ingested into these data analysis tools.
Before diving in to a lot of analysis, it is important to take the time to clean and sort your dataset, taking the pre-formatted files and combining them in different ways, allowing you to analyse the data with different questions in mind.
Key Points
Intro to R
Overview
Teaching: 30 min
Exercises: 20 minQuestions
What are common operators in R?
What are common data types in R?
What are some base R functions?
How do I deal with missing data?
Objectives
First, lets learn about RStudio.
RStudio is divided into 4 “Panes”: the Source for your scripts and documents (top-left, in the default layout); your Environment/History (top-right) which shows all the objects in your working space (Environment) and your command history (History); your Files/Plots/Packages/Help/Viewer (bottom-right); and the R Console (bottom-left). The placement of these panes and their content can be customized (see menu, Tools -> Global Options -> Pane Layout).
The R Script in the top pane can be saved and edited, while code typed directly into the Console below will disappear after closing the R session.
R can access files on and save outputs to any folder on your computer. R knows where to look for and save files based on the current working directory. This is one of the first things you should set up: a folder you’d like to contain all your data, scripts and outputs. The working directory path will be different for everyone. For the workshop, we’ve included the path one of our instructors uses, but you should use your computer’s file explorer to find the correct path for your data.
Setting up R
# Packages ####
# once you install packages to your computer, you can "check them out" of your packages library each time you need them
# make sure you check the "mask" messages that appear - sometimes packages have functions with the same names!
library(tidyverse)# really neat collection of packages! https://www.tidyverse.org/
library(lubridate)
library(readxl)
library(viridis)
library(plotly)
library(ggmap)
# Working Directory ####
#Instructors!! since this lesson is mostly base R we're not going to make four copies of it as with the other nodes.
#Change this one line as befits your audience.
setwd('YOUR/PATH/TO/data/NODE') #set folder you're going to work in
getwd() #check working directory
#you can also change it in the RStudio interface by navigating in the file browser where your working directory should be
#(if you can't see the folder you want, choose the three horizonal dots on the right side of the Home bar),
#and clicking on the blue gear icon "More", and select "Set As Working Directory".
Before we begin the lesson proper, a note on finding additional help. R Libraries, like those included above, are broad and contain many functions. Though most include documentation that can help if you know what to look for, sometimes more general help is necessary. To that end, RStudio maintains cheatsheets for several of the most popular libraries, which can be found here: https://www.rstudio.com/resources/cheatsheets/. As a start, the page includes an RStudio IDE cheatsheet that you may find useful while learning to navigate your workspace. With that in mind, let’s start learning R.
Intro to R
Like most programming langauges, we can do basic mathematical operations with R. These, along with variable assignment, form the basis of everything for which we will use R.
Operators
Operators in R include standard mathematical operators (+, -, *, /) as well as an assignment operator, <- (a less-than sign followed by a hyphen). The assignment operator is used to associate a value with a variable name (or, to ‘assign’ the value a name). This lets us refer to that value later, by the name we’ve given to it. This may look unfamiliar, but it fulfils the same function as the ‘=’ operator in most other languages.
3 + 5 #maths! including - , *, /
weight_kg <- 55 #assignment operator! for objects/variables. shortcut: alt + -
weight_kg
weight_lb <- 2.2 * weight_kg #can assign output to an object. can use objects to do calculations
Variables Challenge
If we change the value of weight_kg to be 100, does the value of weight_lb also change? Remember: You can check the contents of an object by typing out its name and running the line in RStudio.
Solution
No! You have to re-assign 2.2*weight_kg to the object weight_lb for it to update.
The order you run your operations is very important, if you change something you may need to re-run everything!
weight_kg <- 100 weight_lb #didnt change! weight_lb <- 2.2 * weight_kg #now its updated
Functions
While we can write code as we have in the section above - line by line, executed one line at a time - it is often more efficient to run multiple lines of code at once. By using functions, we can even compress complex calculations into just one line!
Functions use a single name to refer to underlying blocks of code that execute a specific calculation. To run a function you need two things: the name of the function, which is usually indicative of the function’s purpose; and the function’s arguments- the variables or values on which the function should execute.
#functions take "arguments": you have to tell them what to run their script against
ten <- sqrt(weight_kg) #contain calculations wrapped into one command to type.
#Output of the function can be assigned directly to a variable...
round(3.14159) #... but doesn't have to be.
Since there are hundreds of functions and often their functionality can be nuanced, we have several ways to get more information on a given function. First, we can use ‘args()’, itself a function that takes the name of another function as an argument, which will tell us the required arguments of the function against which we run it.
Second, we can use the ‘?’ operator. Typing a question mark followed by the name of a function will open a Help window in RStudio’s bottom-right panel. This will contain the most complete documentation available for the function in question.
args(round) #the args() function will show you the required arguments of another function
?round #will show you the full help page for a function, so you can see what it does
Functions Challenge
Can you round the value 3.14159 to two decimal places?
Hint: Using args() on a function can give you a clue.
Solution
round(3.14159, 2) #the round function's second argument is the number of digits you want in the result round(3.14159, digits = 2) #same as above round(digits = 2, x = 3.14159) #when reordered you need to specify
Vectors and Data Types
While variables can hold a single value, sometimes we want to store multiple values in the same variable name. For this, we can use an R data structure called a ‘vector.’ Vectors contain one or more variables of the same data type, and can be assigned to a single variable name, as below.
weight_g <- c(21, 34, 39, 54, 55) #use the combine function to join values into a vector object
length(weight_g) #explore vector
class(weight_g) #a vector can only contain one data type
str(weight_g) #find the structure of your object.
Above, we mentioned ‘data type’. This refers to the kind of data represented by a value, or stored by the appropriate variable. Data types include character (words or letters), logical (boolean TRUE or FALSE values), or numeric data. Crucially, vectors can only contain one type of data, and will force all data in the vector to conform to that type (i.e, data in the vector will all be treated as the same data type, regardless of whether or not it was of that type when the vector was created.) We can always check the data type of a variable or vector by using the ‘class()’ function, which takes the variable name as an argument.
#our first vector is numeric.
#other options include: character (words), logical (TRUE or FALSE), integer etc.
animals <- c("mouse", "rat", "dog") #to create a character vector, use quotes
class(weight_g)
class(animals)
# Note:
#R will convert (force) all values in a vector to the same data type.
#for this reason: try to keep one data type in each vector
#a data table / data frame is just multiple vectors (columns)
#this is helpful to remember when setting up your field sheets!
Vectors Challenge
What data type will this vector become?
challenge3 <- c(1, 2, 3, "4")
Hint: You can check a vector’s type with the class() function.
Solution
R will force all of these to be characters, since the number 4 has quotes around it! Will always coerce data types following this structure: logical → numeric → character ← logical
class(challenge3)
Indexing and Subsetting
We can use subsetting to select only a portion of a vector. For this, we use square brackets after the name of a vector. If we supply a single numeric value, as below, we will retrieve only the value from that index of the vector. Note: vectors in R are indexed with 1 representing the first index- other languages use 0 for the start of their array, so if you are coming from a language like Python, this can be disorienting.
animals #calling your object will print it out
animals[2] #square brackets = indexing. selects the 2nd value in your vector
We can select a specific value, as above, but we can also select one or more entries based on conditions. By supplying one or more criteria to our indexing syntax, we can retrieve the elements of the array that match that criteria.
weight_g > 50 #conditional indexing: selects based on criteria
weight_g[weight_g <=30 | weight_g == 55] #many new operators here!
#<= less than or equal to; | "or"; == equal to. Also available are >=, greater than or equal to; < and > for less than or greater than (no equals); and & for "and".
weight_g[weight_g >= 30 & weight_g == 21] # >= greater than or equal to, & "and"
# this particular example give 0 results - why?
Missing Data
In practical data analysis, our data is often incomplete. It is therefore useful to cover some methods of dealing with NA values. NA is R’s shorthand for a null value; or a value where there is no data. Certain functions cannot process NA data, and therefore provide arguments that allow NA values to be removed before the function execution.
heights <- c(2, 4, 4, NA, 6)
mean(heights) #some functions cant handle NAs
mean(heights, na.rm = TRUE) #remove the NAs before calculating
This can be done within an individual function as above, but for our entire analysis we may want to produce a copy of our dataset without the NA values included. Below, we’ll explore a few ways to do that.
heights[!is.na(heights)] #select for values where its NOT NA
#[] square brackets are the base R way to select a subset of data --> called indexing
#! is an operator that reverses the function
na.omit(heights) #omit the NAs
heights[complete.cases(heights)] #select only complete cases
Missing Data Challenge
Question 1: Using the following vector of heighs in inches, create a new vector, called heights_no_na, with the NAs removed.
heights <- c(63, 69, 60, 65, NA, 68, 61, 70, 61, 59, 64, 69, 63, 63, NA, 72, 65, 64, 70, 63, 65)
Solution
heights_no_na <- heights[!is.na(heights)] # or heights_no_na <- na.omit(heights) # or heights_no_na <- heights[complete.cases(heights)]
Question 2: Use the function median() to calculate the median of the heights vector.
Solution
median(heights, na.rm = TRUE)
Bonus question: Use R to figure out how many people in the set are taller than 67 inches.
Solution
heights_above_67 <- heights_no_na[heights_no_na > 67] length(heights_above_67)
Key Points
Starting with Data Frames
Overview
Teaching: 25 min
Exercises: 10 minQuestions
How do I import tabular data?
How do I explore my data set?
What are some basic data manipulation functions?
Objectives
Dataframes and dplyr
In this lesson, we’re going to introduce a package called dplyr
. dplyr takes advantage of an operator called a pipe to create chains of data manipulation that produce powerful exploratory summaries. It also provides a suite of further functionality for manipulating dataframes: tabular sets of data that are common in data analysis. If you’ve imported the tidyverse
library, as we did during setup and in the last episode, then congratulations: you already have dplyr (along with a host of other useful packages). As an aside, the cheat sheets for dplyr
and readr
may be useful when reviewing this lesson.
You may not be familiar with dataframes by name, but you may recognize the structure. Dataframes are arranged into rows and columns, not unlike tables in typical spreadsheet format (ex: Excel). In R, they are represented as vectors of vectors: that is, a vector wherein each column is itself a vector. If you are familiar with matrices, or two-dimensional arrays in other languages, the structure of a dataframe will be clear to you.
However, dataframes are not merely vectors - they are a specific type of object with their own functionality, which we will cover in this lesson.
We are going to use GLATOS-style detection extracts for this lesson.
Importing from CSVs
Before we can start analyzing our data, we need to import it into R. Fortunately, we have a function for this. read_csv
is a function from the readr
package, also included with the tidyverse library. This function can read data from a .csv file into a dataframe. “.csv” is an extension that denotes a Comma-Separated Value file, or a file wherein data is arranged into rows, and entries within rows are delimited by commas. They’re common in data analysis.
For the purposes of this lesson, we will only cover read_csv; however, there is another function, read_excel
, which you can use to import excel files. It’s from a different library (readxl
) and is outside the scope of this lesson, but worth investigating if you need it.
To import your data from your CSV file, we just need to pass the file path to read_csv, and assign the output to a variable. Note that the file path you give to read_csv will be relative to the working directory you set in the last lesson, so keep that in mind.
#imports file into R. paste the filepath to the unzipped file here!
lamprey_dets <- read_csv("inst_extdata_lamprey_detections.csv", guess_max = 3103)
You may have noticed that our call to read_csv
has a second argument: guess_max. This is a useful argument when some of our columns begin with a lot of NULL values. When determining what data type to assign to a column, rather than checking every single entry, R will check the first few and make a guess based on that. If the first few values are null, R will get confused and throw an error when it actually finds data further down in the column. guess_max
lets us tell R exactly how many columns to read before trying to make a guess. This way, we know it will read enough entries in each column to actually find data, which it will prioritize over the NULL values when assigning a type to the column. This parameter isn’t always necessary, but it can be vital depending on your dataset.
We can now refer to the variable lamprey_dets
to access, manipulate, and view the data from our CSV. In the next sections, we will explore some of the basic operations you can perform on dataframes.
Exploring Detection Extracts
Let’s start with a practical example. What can we find out about these matched detections? We’ll begin by running the code below, and then give some insight into what each function does. Remember, if you’re ever confused about the purpose of a function, you can use ‘?’ followed by the function name (i.e, ?head, ?View) to get more information.
head(lamprey_dets) #first 6 rows
View(lamprey_dets) #can also click on object in Environment window
str(lamprey_dets) #can see the type of each column (vector)
glimpse(lamprey_dets) #similar to str()
#summary() is a base R function that will spit out some quick stats about a vector (column)
#the $ syntax is the way base R selects columns from a data frame
summary(lamprey_dets$release_latitude)
You may now have an idea of what each of those functions does, but we will briefly explain each here.
head
takes the dataframe as a parameter and returns the first 6 rows of the dataframe. This is useful if you want to quickly check that a dataframe imported, or that all the columns are there, or see other such at-a-glance information. Its primary utility is that it is much faster to load and review than the entire dataframe, which may be several tens of thousands of rows long. Note that the related function tail
will return the last six elements.
If we do want to load the entire dataframe, though, we can use View
, which will open the dataframe in its own panel, where we can scroll through it as though it were an Excel file. This is useful for seeking out specific information without having to consult the file itself. Note that for large dataframes, this function can take a long time to execute.
Next are the functions str
and glimpse
, which do similar but different things. str
is short for ‘structure’ and will print out a lot of information about your dataframe, including the number of rows and columns (called ‘observations’ and ‘variables’), the column names, the first four entries of each column, and each column type as well. str
can sometimes be a bit overwhelming, so if you want to see a more condensed output, glimpse
can be useful. It prints less information, but is cleaner and more compact, which can be desirable.
Finally, we have the summary
function, which takes a single column from a dataframe and produces a summary of its basic statistics. You may have noticed the ‘$’ in the summary call- this is how we index a specific column from a dataframe. In this case, we are referring to the latitude column of our dataframe.
Using what you now know about summary functions, try to answer the challenge below.
Detection Extracts Challenge
Question 1: What is the class of the station column in lamprey_dets, and how many rows and columns are in the lamprey_dets dataset??
Solution
The column is a character, and there are 5,923 rows with 30 columns
str(lamprey_dets) # or glimpse(lamprey_dets)
Data Manipulation
Now that we’ve learned how to import and summarize our data, we can learn how to use dplyr
to manipulate it. The name ‘dplyr’ may seem esoteric- the ‘d’ is short for ‘dataframe’, and ‘plyr’ is meant to evoke pliers, and thereby cutting, twisting, and shaping. This is an elegant summation of the dplyr
library’s functionality.
We are going to introduce a new operator in this section, called the “dplyr pipe”. Not to be confused with |
, which is also called a pipe in some other languages, the dplyr pipe is rendered as %>%
. A pipe takes the output of the function or contents of the variable on the left and passes them to the function on the right. It is often read as “and then.” If you want to quickly add a pipe, the keybord shortcut CTRL + SHIFT + M
will do so.
library(dplyr) #can use tidyverse package dplyr to do exploration on dataframes in a nicer way
# %>% is a "pipe" which allows you to join functions together in sequence.
lamprey_dets %>% dplyr::select(6) #selects column 6
# Using the above transliteration: "take lamprey_dets AND THEN select column number 6 from it using the select function in the dplyr library"
You may have noticed another unfamiliar operator above, the double colon (::
). This is used to specify the package from which we want to pull a function. Until now, we haven’t needed this, but as your code grows and the number of libraries you’re using increases, it’s likely that multiple functions across several different packages will have the same name (a phenomenon called “overloading”). R has no automatic way of knowing which package contains the function you are referring to, so using double colons lets us specify it explicitly. It’s important to be able to do this, since different functions with the same name often do markedly different things.
Let’s explore a few other examples of how we can use dplyr and pipes to manipulate our dataframe.
lamprey_dets %>% slice(1:5) #selects rows 1 to 5 in the dplyr way
# Take lamprey_dets AND THEN slice rows 1 through 5.
#We can also use multiple pipes.
lamprey_dets %>%
distinct(glatos_array) %>% nrow #number of arrays that detected my fish in dplyr!
# Take lamprey_dets AND THEN select only the unique entries in the glatos_array column AND THEN count them with nrow.
#We can do the same as above with other columns too.
lamprey_dets %>%
distinct(animal_id) %>%
nrow #number of animals that were detected
# Take lamprey_dets AND THEN select only the unique entries in the animal_id column AND THEN count them with nrow.
#We can use filtering to conditionally select rows as well.
lamprey_dets %>% filter(animal_id=="A69-1601-1363")
# Take lamprey_dets AND THEN select only those rows where animal_id is equal to the above value.
lamprey_dets %>% filter(detection_timestamp_utc >= '2012-06-01 00:00:00') #all dets in/after October of 2016
# Take lamprey_dets AND THEN select only those rows where monthcollected is greater than or equal to June 1 2012.
These are all ways to extract a specific subset of our data, but dplyr
can also be used to manipulate dataframes to give you even greater insights. We’re now going to use two new functions: group_by
, which allows us to group our data by the values of a single column, and summarise
(not to be confused with summary
above!), which can be used to calculate summary statistics across your grouped variables, and produces a new dataframe containing these values as the output. These functions can be difficult to grasp, so don’t forget to use ?group_by
and ?summarise
if you get lost.
#get the mean value across a column using GroupBy and Summarize
lamprey_dets %>% #Take lamprey_dets, AND THEN...
group_by(animal_id) %>% #Group the data by animal_id- that is, create a group within the dataframe where each group contains all the rows related to a specific animal_id. AND THEN...
summarise(MeanLat=mean(deploy_lat)) #use summarise to add a new column containing the mean latitude of each group. We named this new column "MeanLat" but you could name it anything
With just a few lines of code, we’ve created a dataframe that contains each of our catalog numbers and the mean latitude at which those fish were detected. dplyr
, its wide array of functions, and the powerful pipe operator can let us build out detailed summaries like this one without writing too much code.
Data Manipulation Challenge
Question 1: Find the max lat and max longitude for animal “A69-1601-1363”.
Solution
lamprey_dets %>% filter(animal_id=="A69-1601-1363") %>% summarise(MaxLat=max(deploy_lat), MaxLong=max(deploy_long))
Question 2: Find the min lat/long of each animal for detections occurring in July 2012.
Solution
lamprey_dets %>% filter(detection_timestamp_utc >= "2012-07-01 00:00:00" & detection_timestamp_utc < "2012-08-01 00:00:00" ) %>% group_by(animal_id) %>% summarise(MinLat=min(deploy_lat), MinLong=min(deploy_long))
Joining Detection Extracts
We’re now going to briefly touch on a few useful dataframe use-cases that aren’t directly related to dplyr
, but with which dplyr
can help us.
One function that we’ll need to know is rbind
, a base R function which lets us combine two R objects together. This is particularly useful if you have more than one detection extract provided by GLATOS (perhaps multiple projects).
walleye_dets <- read_csv("inst_extdata_walleye_detections.csv", guess_max = 9595) #Import walleye detections
all_dets <- rbind(lamprey_dets, walleye_dets) #Now join the two dataframes
Dealing with Datetimes
Datetime data is in a special format which is neither numeric nor character. It can be tricky to deal with, too, since Excel frequently reformats dates in any file it opens. We also have to concern ourselves with practical matters of time, like time zone and date formatting. Fortunately, the lubridate
library gives us a whole host of functionality to manage datetime data. For additional help, the cheat sheet for lubridate
may prove a useful resource.
We’ll also use a dplyr
function called mutate
, which lets us add new columns or change existing ones, while preserving the existing data in the table. Be careful not to confuse this with its sister function transmute
, which adds or manipulates columns while dropping existing data. If you’re ever in doubt as to which is which, remember: ?mutate
and ?transmute
will bring up the help files.
library(lubridate)
lamprey_dets %>% mutate(detection_timestamp_utc=ymd_hms(detection_timestamp_utc)) #Tells R to treat this column as a date, not number numbers
#as.POSIXct(lamprey_dets$detection_timestamp_utc) #this is the base R way - if you ever see this function
We’ve just used a single function, ymd_hms
, but with it we’ve been able to completely reformat the entire detection_timestamp_utc column. ymd_hms
is short for Year, Month, Day, Hours, Minutes, and Seconds. For example, at time of writing, it’s 2021-05-14 14:21:40. Other format functions exist too, like dmy_hms
, which specifies the day first and year third (i.e, 14-05-2021 14:21:40). Investigate the documentation to find which is right for you.
There are too many useful lubridate functions to cover in the scope of this lesson. These include parse_date_time
, which can be used to read in date data in multiple formats, which is useful if you have a column contianing heterogenous date data; as well as with_tz
, which lets you make your data sensitive to timezones (including automatic daylight savings time awareness). Dates are a tricky subject, so be sure to investigate lubridate
to make sure you find the functions you need.
Key Points
Intro to Plotting
Overview
Teaching: 15 min
Exercises: 10 minQuestions
How do I plot my data?
How can I plot summaries of my data?
Objectives
Learn how to make basic plots with ggplot2
Learn how to combine dplyr summaries with ggplot2 plots
Background
Now that we have learned how to import, inspect, and manipulate our data, we are next going to learn how to visualize it. R provides a robust plotting suite in the library ggplot2
. ggplot2
takes advantage of tidyverse pipes and chains of data manipulation to build plotting code. Additionally, it separates the aesthetics of the plot (what are we plotting) from the styling of the plot (what the plot looks like). What this means is that data aesthetics and styles can be built separately and then combined and recombined to produce modular, reusable plotting code. If ggplot
seems daunting, the cheat sheet may prove useful.
While ggplot2
function calls can look daunting at first, they follow a single formula, detailed below.
#Anything within <> braces will be replaced in an actual function call.
ggplot(data = <DATA>, mapping = aes(<MAPPINGS>)) + <GEOM_FUNCTION>
In the above example, there are three important parts: <DATA>
, <MAPPINGS>
, and <GEOM_FUNCTION>
.
<DATA>
refers to the data that we’ll be plotting. In general, this will be held in a dataframe like the one we prepared in the previous lessons.
<MAPPINGS>
refers to the aesthetic mappings for the data- that is, which columns in the data will be used to determine which attributes of the graph. For example, if you have columns for latitude and longitude, you may want to map these onto the X and Y axes of the graph. We’ll cover how to do exactly that in a moment.
Finally, <GEOM_FUNCTION>
refers to the style of the plot: what type of plot are we going to make. GEOM is short for “geometry” and ggplot2
contains many different ‘geom’ functions that you can use. For this lesson, we’ll be using geom_point()
, which produces a scatterplot, but in the future you may want to use geom_path()
, geom_bar()
, geom_boxplot()
or any of ggplots other geom functions. Remember, since these are functions, you can use the help syntax (i.e ?geom_point
) in the R console to find out more about them and what you need to pass to them.
Now that we’ve introduced ggplot2
, let’s build a functional example with our data.
# Begin by importing the ggplot2 library, which you should have installed as part of setup.
library(ggplot2)
# Build the plot and assign it to a variable.
lamprey_dets_plot <- ggplot(data = lamprey_dets,
mapping = aes(x = deploy_long, y = deploy_lat)) #can assign a base
With a couple of lines of code, we’ve already mostly completed a simple scatter plot of our data. The ‘data’ parameter takes our dataframe, and the mapping parameter takes the output of the aes()
function, which itself takes a mapping of our data onto the axes of the graph. That can be a bit confusing, so let’s briefly break this down. aes()
is short for ‘aesthetics’- the function constructs the aesthetic mappings of our data, which describe how variables in the data are mapped to visual properties of the plot. For example, above, we are setting the ‘x’ attribute to ‘deploy_long’, and the ‘y’ attribute to ‘deploy_lat’. This means that the X axis of our plot will represent longitude, and the Y axis will represent latitude. Depending on the type of plot you’re making, you may want different values there, and different types of geom functions can require different aesthetic mappings (colour, for example, is another common one). You can always type ?aes()
at the console if you want more information.
We still have one step to add to our plotting code: the geom function. We’ll be making a scatterplot, so we want to use geom_point()
.
lamprey_dets_plot +
geom_point(alpha=0.1,
colour = "blue")
#This will layer our chosen geom onto our plot template.
#alpha is a transparency argument in case points overlap. Try alpha = 0.02 to see how it works!
With just the above code, we’ve added our geom to our aesthetic and made our plot ready for display. We’ve built only a very simple plot here, but ggplot2
provides many, many options for building more complex, illustrative plots.
Basic plots
As a minor syntactic note, you can build your plots iteratively, without assigning them to a variable in-between. For this, we make use of tidyverse
pipes.
all_dets %>%
ggplot(aes(deploy_long, deploy_lat)) +
geom_point() #geom = the type of plot
all_dets %>%
ggplot(aes(deploy_long, deploy_lat, colour = common_name_e)) +
geom_point()
#anything you specify in the aes() is applied to the actual data points/whole plot,
#anything specified in geom() is applied to that layer only (colour, size...). sometimes you have >1 geom layer so this makes more sense!
You can see that all we need to do to make this work is omit the ‘data’ parameter, since that’s being passed in by the pipe. Note also that we’ve added colour = common_name_e
to the second plot’s aesthetic, meaning that the output will be coloured based on the species of the animal.
Remembering which of the aes
or the geom
controls which variable can be difficult, but here’s a handy rule of thumb: anything specified in aes()
will apply to the data points themselves, or the whole plot. They are broad statements about how the plot is to be displayed. Anything in the geom_
function will apply only to that geom_
layer. Keep this in mind, since it’s possible for your plot to have more than one geom_
!
Plotting and dplyr Challenge
Try combining with
dplyr
functions in this challenge! Try making a scatterplot showing the lat/long for animal “A69-1601-1363”, coloured by detection arraySolution
all_dets %>% filter(animal_id=="A69-1601-1363") %>% ggplot(aes(deploy_long, deploy_lat, colour = glatos_array)) + geom_point()
What other geoms are there? Try typing
geom_
into R to see what it suggests!
Key Points
You can feed output from dplyr’s data manipulation functions into ggplot using pipes.
Plotting various summaries and groupings of your data is good practice at the exploratory phase, and dplyr and ggplot make iterating different ideas straightforward.
Telemetry Reports - Imports
Overview
Teaching: 10 min
Exercises: 0 minQuestions
What datasets do I need from the Network?
How do I import all the datasets?
Objectives
Importing all the datasets
Now that we have an idea of what an exploratory workflow might look like with Tidyverse libraries like dplyr
and ggplot2
, let’s look at how we might implement a common telemetry workflow using these tools.
For the GLATOS Network you will receive Detection Extracts which include all the Tag matches for your animals. These can be used to create many meaningful summary reports.
First, we will comfirm we have our Tag Matches stored in a dataframe.
View(all_dets) #already have our tag matches
# if you do not have the variable created from a previous lesson, you can use the following code to re-create it:
#lamprey_dets <- read_csv("inst_extdata_lamprey_detections.csv", guess_max = 3103)
#walleye_dets <- read_csv("inst_extdata_walleye_detections.csv", guess_max = 9595)
# lets join these two detection files together!
#all_dets <- rbind(lamprey_dets, walleye_dets)
To give meaning to these detections we should import our GLATOS Workbook. These are in the standard GLATOS-style template which can be found here.
library(readxl)
# Deployment Metadata
walleye_deploy <- read_excel('inst_extdata_walleye_workbook.xlsm', sheet = 'Deployment') #pull in deploy sheet
View(walleye_deploy)
walleye_recovery <- read_excel('inst_extdata_walleye_workbook.xlsm', sheet = 'Recovery') #pull in recovery sheet
View(walleye_recovery)
#join the deploy and recovery sheets together
walleye_recovery <- walleye_recovery %>% rename(INS_SERIAL_NO = INS_SERIAL_NUMBER) #first, rename INS_SERIAL_NUMBER so they match between the two dataframes.
walleye_recievers <- merge(walleye_deploy, walleye_recovery,
by.x = c("GLATOS_PROJECT", "GLATOS_ARRAY", "STATION_NO",
"CONSECUTIVE_DEPLOY_NO", "INS_SERIAL_NO"),
by.y = c("GLATOS_PROJECT", "GLATOS_ARRAY", "STATION_NO",
"CONSECUTIVE_DEPLOY_NO", "INS_SERIAL_NO"),
all.x=TRUE, all.y=TRUE) #keep all the info from each, merged using the above columns
View(walleye_recievers)
# Tagging metadata
walleye_tag <- read_excel('inst_extdata_walleye_workbook.xlsm', sheet = 'Tagging')
View(walleye_tag)
#remember: we learned how to switch timezone of datetime columns above,
# if that is something you need to do with your dataset!!
#hint: check GLATOS_TIMEZONE column to see if its what you want!
The glatos
R package (which will be introduced in future lessons) can import your Workbook in one step! The function will format all datetimes to UTC, check for conflicts, join the deploy/recovery tabs etc. This package is beyond the scope of this lesson, but is incredibly useful for GLATOS Network members. Below is some example code:
# this won't work unless you happen to have this installed - just an teaser today, will be covered tomorrow
library(glatos)
data <- read_glatos_workbook('inst_extdata_walleye_workbook.xlsm')
receivers <- data$receivers
animals <- data$animals
Finally, we can import the station locations for the entire GLATOS Network, to help give context to our detections which may have occured on parter arrays.
glatos_receivers <- read_csv("inst_extdata_sample_receivers.csv")
View(glatos_receivers)
Key Points
Telemetry Reports for Array Operators
Overview
Teaching: 30 min
Exercises: 0 minQuestions
How do I summarize and plot my deployments?
How do I summarize and plot my detections?
Objectives
Mapping GLATOS stations - Static map
This section will use a set of receiver metadata from the GLATOS Network, showing stations which may not be included in our Project. We will make a static map of all the receiver stations in three steps, using the package ggmap
.
First, we set a basemap using the aesthetics and bounding box we desire. Then, we will filter our stations dataset for those which we would like to plot on the map. Next, we add the stations onto the basemap and look at our creation! If we are happy with the product, we can export the map as a .tiff
file using the ggsave
function, to use outside of R. Other possible export formats include: .png
, .jpeg
, .pdf
and more.
library(ggmap)
#first, what are our columns called?
names(glatos_receivers)
#make a basemap for all of the stations, using the min/max deploy lat and longs as bounding box
base <- get_stamenmap(
bbox = c(left = min(glatos_receivers$deploy_long),
bottom = min(glatos_receivers$deploy_lat),
right = max(glatos_receivers$deploy_long),
top = max(glatos_receivers$deploy_lat)),
maptype = "terrain-background",
crop = FALSE,
zoom = 8)
#filter for stations you want to plot - this is very customizable
glatos_deploy_plot <- glatos_receivers %>%
dplyr::mutate(deploy_date=ymd_hms(deploy_date_time)) %>% #make a datetime
dplyr::mutate(recover_date=ymd_hms(recover_date_time)) %>% #make a datetime
dplyr::filter(!is.na(deploy_date)) %>% #no null deploys
dplyr::filter(deploy_date > '2011-07-03' & recover_date < '2018-12-11') %>% #only looking at certain deployments, can add start/end dates here
dplyr::group_by(station, glatos_array) %>%
dplyr::summarise(MeanLat=mean(deploy_lat), MeanLong=mean(deploy_long)) #get the mean location per station, in case there is >1 deployment
# you could choose to plot stations which are within a certain bounding box!
#to do this you would add another filter to the above data, before passing to the map
# ex: add this line after the mutate() clauses:
# filter(latitude <= 0.5 & latitude >= 24.5 & longitude <= 0.6 & longitude >= 34.9)
#add your stations onto your basemap
glatos_map <-
ggmap(base, extent='panel') +
ylab("Latitude") +
xlab("Longitude") +
geom_point(data = glatos_deploy_plot, #filtering for recent deployments
aes(x = MeanLong,y = MeanLat, colour = glatos_array), #specify the data
shape = 19, size = 2) #lots of aesthetic options here!
#view your receiver map!
glatos_map
#save your receiver map into your working directory
ggsave(plot = glatos_map, filename = "glatos_map.tiff", units="in", width=15, height=8)
#can specify location, file type and dimensions
Mapping our stations - Static map
We can do the same exact thing with the deployment metadata from OUR project only! This will use metadata imported from our Workbook.
base <- get_stamenmap(
bbox = c(left = min(walleye_recievers$DEPLOY_LONG),
bottom = min(walleye_recievers$DEPLOY_LAT),
right = max(walleye_recievers$DEPLOY_LONG),
top = max(walleye_recievers$DEPLOY_LAT)),
maptype = "terrain-background",
crop = FALSE,
zoom = 8)
#filter for stations you want to plot - this is very customizable
walleye_deploy_plot <- walleye_recievers %>%
dplyr::mutate(deploy_date=ymd_hms(GLATOS_DEPLOY_DATE_TIME)) %>% #make a datetime
dplyr::mutate(recover_date=ymd_hms(GLATOS_RECOVER_DATE_TIME)) %>% #make a datetime
dplyr::filter(!is.na(deploy_date)) %>% #no null deploys
dplyr::filter(deploy_date > '2011-07-03' & is.na(recover_date)) %>% #only looking at certain deployments, can add start/end dates here
dplyr::group_by(STATION_NO, GLATOS_ARRAY) %>%
dplyr::summarise(MeanLat=mean(DEPLOY_LAT), MeanLong=mean(DEPLOY_LONG)) #get the mean location per station, in case there is >1 deployment
#add your stations onto your basemap
walleye_deploy_map <-
ggmap(base, extent='panel') +
ylab("Latitude") +
xlab("Longitude") +
geom_point(data = walleye_deploy_plot, #filtering for recent deployments
aes(x = MeanLong,y = MeanLat, colour = GLATOS_ARRAY), #specify the data
shape = 19, size = 2) #lots of aesthetic options here!
#view your receiver map!
walleye_deploy_map
#save your receiver map into your working directory
ggsave(plot = walleye_deploy_map, filename = "walleye_deploy_map.tiff", units="in", width=15, height=8)
#can specify location, file type and dimensions
Mapping all GLATOS Stations - Interactive map
An interactive map can contain more information than a static map. Here we will explore the package plotly
to create interactive “slippy” maps. These allow you to explore your map in different ways by clicking and scrolling through the output.
First, we will set our basemap’s aesthetics and bounding box and assign this information (as a list) to a geo_styling variable.
library(plotly)
#set your basemap
geo_styling <- list(
fitbounds = "locations", visible = TRUE, #fits the bounds to your data!
showland = TRUE,
showlakes = TRUE,
lakecolor = toRGB("blue", alpha = 0.2), #make it transparent
showcountries = TRUE,
landcolor = toRGB("gray95"),
countrycolor = toRGB("gray85")
)
Then, we choose which Deployment Metadata dataset we wish to use and identify the columns containing Latitude and Longitude, using the plot_geo
function.
#decide what data you're going to use. We have chosen glatos_deploy_plot which we created earlier.
glatos_map_plotly <- plot_geo(glatos_deploy_plot, lat = ~MeanLat, lon = ~MeanLong)
Next, we use the add_markers
function to write out what information we would like to have displayed when we hover our mouse over a station in our interactive map. In this case, we chose to use paste
to join together the Station Name and its lat/long.
#add your markers for the interactive map
glatos_map_plotly <- glatos_map_plotly %>% add_markers(
text = ~paste(station, MeanLat, MeanLong, sep = "<br />"),
symbol = I("square"), size = I(8), hoverinfo = "text"
)
Finally, we add all this information together, along with a title, using the layout
function, and now we can explore our interactive map!
#Add layout (title + geo stying)
glatos_map_plotly <- glatos_map_plotly %>% layout(
title = 'GLATOS Deployments<br />(> 2011-07-03)', geo = geo_styling
)
#View map
glatos_map_plotly
To save this interactive map as an .html
file, you can explore the function htmlwidgets::saveWidget(), which is beyond the scope of this lesson.
How are my stations performing?
Let’s find out more about the animals detected by our array! These summary statistics, created using dplyr
functions, could be used to help determine the how successful each of your stations has been at detecting your tagged animals. We will also learn how to export our results using write_csv
.
#How many detections of my tags does each station have?
library(dplyr)
det_summary <- all_dets %>%
filter(glatos_project_receiver == 'HECST') %>% #choose to summarize by array, project etc!
mutate(detection_timestamp_utc=ymd_hms(detection_timestamp_utc)) %>%
group_by(station, year = year(detection_timestamp_utc), month = month(detection_timestamp_utc)) %>%
summarize(count =n())
det_summary #number of dets per month/year per station
#How many detections of my tags does each station have? Per species
anim_summary <- all_dets %>%
filter(glatos_project_receiver == 'HECST') %>% #choose to summarize by array, project etc!
mutate(detection_timestamp_utc=ymd_hms(detection_timestamp_utc)) %>%
group_by(station, year = year(detection_timestamp_utc), month = month(detection_timestamp_utc), common_name_e) %>%
summarize(count =n())
anim_summary #number of dets per month/year per station & species
# Create a new data product, det_days, that give you the unique dates that an animal was seen by a station
stationsum <- all_dets %>%
group_by(station) %>%
summarise(num_detections = length(animal_id),
start = min(detection_timestamp_utc),
end = max(detection_timestamp_utc),
uniqueIDs = length(unique(animal_id)),
det_days=length(unique(as.Date(detection_timestamp_utc))))
View(stationsum)
Key Points
Telemetry Reports for Tag Owners
Overview
Teaching: 30 min
Exercises: 0 minQuestions
How do I summarize and plot my detections?
How do I summarize and plot my tag metadata?
Objectives
Mapping my Detections and Releases - static map
Where were my fish observed? We will make a static map of all the receiver stations where my fish was detected in two steps, using the package ggmap
.
First, we set a basemap using the aesthetics and bounding box we desire. Next, we add the detection locations onto the basemap and look at our creation!
base <- get_stamenmap(
bbox = c(left = min(all_dets$deploy_long),
bottom = min(all_dets$deploy_lat),
right = max(all_dets$deploy_long),
top = max(all_dets$deploy_lat)),
maptype = "terrain-background",
crop = FALSE,
zoom = 8)
#add your detections onto your basemap
detections_map <-
ggmap(base, extent='panel') +
ylab("Latitude") +
xlab("Longitude") +
geom_point(data = all_dets,
aes(x = deploy_long,y = deploy_lat, colour = common_name_e), #specify the data
shape = 19, size = 2) #lots of aesthetic options here!
#view your detections map!
detections_map
Mapping my Detections and Releases - interactive map
An interactive map can contain more information than a static map. Here we will explore the package plotly
to create interactive “slippy” maps. These allow you to explore your map in different ways by clicking and scrolling through the output.
First, we will set our basemap’s aesthetics and bounding box and assign this information (as a list) to a geo_styling variable. Then, we choose which detections we wish to use and identify the columns containing Latitude and Longitude, using the plot_geo
function. Next, we use the add_markers
function to write out what information we would like to have displayed when we hover our mouse over a station in our interactive map. In this case, we chose to use paste
to join together the Station Name and its lat/long. Finally, we add all this information together, along with a title, using the layout
function, and now we can explore our interactive map!
#set your basemap
geo_styling <- list(
fitbounds = "locations", visible = TRUE, #fits the bounds to your data!
showland = TRUE,
showlakes = TRUE,
lakecolor = toRGB("blue", alpha = 0.2), #make it transparent
showcountries = TRUE,
landcolor = toRGB("gray95"),
countrycolor = toRGB("gray85")
)
#decide what data you're going to use
detections_map_plotly <- plot_geo(all_dets, lat = ~deploy_lat, lon = ~deploy_long)
#add your markers for the interactive map
detections_map_plotly <- detections_map_plotly %>% add_markers(
text = ~paste(animal_id, common_name_e, paste("Date detected:", detection_timestamp_utc),
paste("Latitude:", deploy_lat), paste("Longitude",deploy_long),
paste("Detected by:", glatos_array), paste("Station:", station),
paste("Project:",glatos_project_receiver), sep = "<br />"),
symbol = I("square"), size = I(8), hoverinfo = "text"
)
#Add layout (title + geo stying)
detections_map_plotly <- detections_map_plotly %>% layout(
title = 'Lamprey and Walleye Detections<br />(2012-2013)', geo = geo_styling
)
#View map
detections_map_plotly
Summary of tagged animals
This section will use your Tagging Metadata to create dplyr
summaries of your tagged animals.
# summary of animals you've tagged
walleye_tag_summary <- walleye_tag %>%
mutate(GLATOS_RELEASE_DATE_TIME = ymd_hms(GLATOS_RELEASE_DATE_TIME)) %>%
#filter(GLATOS_RELEASE_DATE_TIME > '2012-06-01') %>% #select timeframe, specific animals etc.
group_by(year = year(GLATOS_RELEASE_DATE_TIME), COMMON_NAME_E) %>%
summarize(count = n(),
Meanlength = mean(LENGTH, na.rm=TRUE),
minlength= min(LENGTH, na.rm=TRUE),
maxlength = max(LENGTH, na.rm=TRUE),
MeanWeight = mean(WEIGHT, na.rm = TRUE))
#view our summary table
walleye_tag_summary
Detection Attributes
Lets add some biological context to our summaries!
#Average location of each animal!
all_dets %>%
group_by(animal_id) %>%
summarize(NumberOfStations = n_distinct(station),
AvgLat = mean(deploy_lat),
AvgLong =mean(deploy_long))
# Avg length per location
all_dets_summary <- all_dets %>%
mutate(detection_timestamp_utc = ymd_hms(detection_timestamp_utc)) %>%
group_by(glatos_array, station, deploy_lat, deploy_long, common_name_e) %>%
summarise(AvgSize = mean(length, na.rm=TRUE))
all_dets_summary
#export our summary table as CSV
write_csv(all_dets_summary, "detections_summary.csv", col_names = TRUE)
# count detections per transmitter, per array
all_dets %>%
group_by(animal_id, glatos_array, common_name_e) %>%
summarize(count = n()) %>%
select(animal_id, common_name_e, glatos_array, count)
# list all GLATOS arrays each fish was seen on, and a number_of_arrays column too
arrays <- all_dets %>%
group_by(animal_id) %>%
mutate(arrays = (list(unique(glatos_array)))) %>% #create a column with a list of the arrays
dplyr::select(animal_id, arrays) %>% #remove excess columns
distinct_all() %>% #keep only one record of each
mutate(number_of_arrays = sapply(arrays,length)) %>% #sapply: applies a function across a List - in this case we are applying length()
as.data.frame()
View(arrays)
#Full summary of each animal's track
animal_id_summary <- all_dets %>%
group_by(animal_id) %>%
summarise(dets = length(animal_id),
stations = length(unique(station)),
min = min(detection_timestamp_utc),
max = max(detection_timestamp_utc),
tracklength = max(detection_timestamp_utc)-min(detection_timestamp_utc))
View(animal_id_summary)
Summary of Detection Counts
Lets make an informative plot showing number of matched detections, per year and month.
all_dets %>%
mutate(detection_timestamp_utc=ymd_hms(detection_timestamp_utc)) %>% #make datetime
mutate(year_month = floor_date(detection_timestamp_utc, "months")) %>% #round to month
filter(common_name_e == 'walleye') %>% #can filter for specific stations, dates etc. doesn't have to be species!
group_by(year_month) %>% #can group by station, species et - doesn't have to be by date
summarize(count =n()) %>% #how many dets per year_month
ggplot(aes(x = (month(year_month) %>% as.factor()),
y = count,
fill = (year(year_month) %>% as.factor())
)
)+
geom_bar(stat = "identity", position = "dodge2")+
xlab("Month")+
ylab("Total Detection Count")+
ggtitle('Walleye Detections by Month (2012-2013)')+ #title
labs(fill = "Year") #legend title
Other Example Plots
Some examples of complex plotting options. The most useful of these may include abacus plotting (an example with ‘animal’ and ‘station’ on the y-axis) as well as an example using ggmap
and geom_path
to create an example map showing animal movement.
# an easy abacus plot!
#Use the color scales in this package to make plots that are pretty,
#better represent your data, easier to read by those with colorblindness, and print well in grey scale.
library(viridis)
abacus_animals <-
ggplot(data = all_dets, aes(x = detection_timestamp_utc, y = animal_id, col = glatos_array)) +
geom_point() +
ggtitle("Detections by animal") +
theme(plot.title = element_text(face = "bold", hjust = 0.5)) +
scale_color_viridis(discrete = TRUE)
abacus_animals
#another way to vizualize
abacus_stations <-
ggplot(data = all_dets, aes(x = detection_timestamp_utc, y = station, col = animal_id)) +
geom_point() +
ggtitle("Detections by station") +
theme(plot.title = element_text(face = "bold", hjust = 0.5)) +
scale_color_viridis(discrete = TRUE)
abacus_stations
#track movement using geom_path!
movMap <-
ggmap(base, extent = 'panel') + #use the BASE we set up before
ylab("Latitude") +
xlab("Longitude") +
geom_path(data = all_dets, aes(x = deploy_long, y = deploy_lat, col = common_name_e)) + #connect the dots with lines
geom_point(data = all_dets, aes(x = deploy_long, y = deploy_lat, col = common_name_e)) + #layer the stations back on
scale_colour_manual(values = c("red", "blue"), name = "Species")+ #
facet_wrap(~animal_id, ncol = 6, nrow=1)+
ggtitle("Inferred Animal Paths")
movMap
# monthly latitudinal distribution of your animals (works best w >1 species)
all_dets %>%
group_by(month=month(detection_timestamp_utc), animal_id, common_name_e) %>% #make our groups
summarise(meanlat=mean(deploy_lat)) %>% #mean lat
ggplot(aes(month %>% factor, meanlat, colour=common_name_e, fill=common_name_e))+ #the data is supplied, but no info on how to show it!
geom_point(size=3, position="jitter")+ # draw data as points, and use jitter to help see all points instead of superimposition
#coord_flip()+ #flip x y, not needed here
scale_colour_manual(values = c("brown", "green"))+ #change the colour to represent the species better!
scale_fill_manual(values = c("brown", "green"))+ #colour of the boxplot
geom_boxplot()+ #another layer
geom_violin(colour="black") #and one more layer
# per-individual contours - lots of plots: called facets!
all_dets %>%
ggplot(aes(deploy_long, deploy_lat))+
facet_wrap(~animal_id)+ #make one plot per individual
geom_violin()
Key Points
BONUS - Introduction to glatos Data Processing Package
Overview
Teaching: 30 min
Exercises: 0 minQuestions
How do I load my data into glatos?
How do I filter out false detections?
How can I consolidate my detections into detection events?
How do I summarize my data?
Objectives
The following bonus lessons require more installations than were provided in the setup for this workshop, if you would like to follow along, the steps can be found here under “Advanced Telemetry Workshop Requirements”
The glatos
package is a powerful toolkit that provides a wide range of functionality for loading, processing, and visualizing your data. With it, you can gain valuable insights with quick and easy commands that condense high volumes of base R into straightforward functions, with enough versatility to meet a variety of needs.
This package was originally created to meet the needs of the Great Lakes Acoustic Telemetry Observation System (GLATOS) and use their specific data formats. However, over time, the functionality has been expanded to allow operations on OTN-formatted data as well, broadening the range of possible applications for the software. As a point of clarification, “GLATOS” (all caps acronym) refers to the organization, while glatos
refers to the package.
Our first step is setting our working directory and importing the relevant libraries.
## Set your working directory ####
setwd("YOUR/PATH/TO/data/glatos")
library(glatos)
library(tidyverse)
library(utils)
library(lubridate)
If you are following along with the workshop in the workshop repository, there should be a folder in ‘data/’ containing data corresponding to your node (at time of writing, FACT, ACT, GLATOS, or MigraMar). glatos
can function with both GLATOS and OTN Node-formatted data, but the functions are different for each. Both, however, provide a marked performance boost over base R, and both ensure that the resulting data set will be compatible with the rest of the glatos
framework.
We’ll start by importing one of the glatos
package’s built-in datasets. glatos
comes with a few datasets that are useful for testing code on a dataset known to work with the package’s functions. For this workshop, we’ll continue using the walleye data that we’ve been working with in previous lessons. First, we’ll use the system.file
function to build the filepath to the walleye data. This saves us having to track down the file in the glatos
package’s file structure- R can find it for us automatically.
# Get file path to example walleye data
det_file_name <- system.file("extdata", "walleye_detections.csv",
package = "glatos")
With our filename in hand, we’ll want to use the read_glatos_detections()
function to load our data into a dataframe. In this case, our data is formatted in the GLATOS style- if it were OTN/ACT formatted, we would want to use read_otn_detections()
instead.
Remember: you can always check a function’s documentation by typing a question mark, followed by the name of the function.
## GLATOS help files are helpful!!
?read_glatos_detections
# Save our detections file data into a dataframe called detections
detections <- read_glatos_detections(det_file=det_file_name)
Remember that we can use head()
to inspect a few lines of our data to ensure it was loaded properly.
# View first 2 rows of output
head(detections, 2)
With our data loaded, we next want to apply a false filtering algorithm to reduce the number of false detections in our dataset. glatos uses the Pincock algorithm
to filter probable false detections based on the time lag between detections- tightly clustered detections are weighted as more likely to be true, while detections spaced out temporally will be marked as false. We can also pass the time-lag threshold as a variable to the false_detections()
function. This lets us fine-tune our filtering to allow for greater or lesser temporal space between detections before they’re flagged as false.
## Filtering False Detections ####
## ?glatos::false_detections
#Write the filtered data to a new det_filtered object
#This doesn't delete any rows, it just adds a new column that tells you whether
#or not a detection was filtered out.
detections_filtered <- false_detections(detections, tf=3600, show_plot=TRUE)
head(detections_filtered)
nrow(detections_filtered)
The false_detections function will add a new column to your dataframe, ‘passed_filter’. This contains a boolean value that will tell you whether or not that record passed the false detection filter. That information may be useful on its own merits; for now, we will just use it to filter out the false detections.
# Filter based on the column if you're happy with it.
detections_filtered <- detections_filtered[detections_filtered$passed_filter == 1,]
nrow(detections_filtered) # Smaller than before
With our data properly filtered, we can begin investigating it and developing some insights. glatos
provides a range of tools for summarizing our data so that we can better see what our receivers are telling us.
We can begin with a summary by animal, which will group our data by the unique animals we’ve detected.
# Summarize Detections ####
#?summarize_detections
#summarize_detections(detections_filtered)
# By animal ====
sum_animal <- summarize_detections(detections_filtered, location_col = 'station', summ_type='animal')
sum_animal
We can also summarize by location, grouping our data by distinct locations.
# By location ====
sum_location <- summarize_detections(detections_filtered, location_col = 'station', summ_type='location')
head(sum_location)
summarize_detections
will return different summaries depending on the summ_type parameter. It can take either “animal”, “location”, or “both”. More information on what these summaries return and how they are structured can be found in the help files (?summarize_detections).
If you had another column that describes the location of a detection, and you would prefer to use that, you can specify it in the function with the location_col
parameter. In the example below, we will create a new column and use that as the location.
# You can make your own column and use that as the location_col
# For example we will create a uniq_station column for if you have duplicate station names across projects
detections_filtered_special <- detections_filtered %>%
mutate(station_uniq = paste(glatos_array, station, sep=':'))
sum_location_special <- summarize_detections(detections_filtered_special, location_col = 'station_uniq', summ_type='location')
head(sum_location_special)
For the next example, we’ll summarise along both animal and location, as outlined above.
# By both dimensions
sum_animal_location <- summarize_detections(det = detections_filtered,
location_col = 'station',
summ_type='both')
head(sum_animal_location)
Summarising by both dimensions will create a row for each station and each animal pair. This can be a bit cluttered, so let’s use a filter to remove every row where the animal was not detected on the corresponding station.
# Filter out stations where the animal was NOT detected.
sum_animal_location <- sum_animal_location %>% filter(num_dets > 0)
sum_animal_location
One other method we can use is to summarize by a subset of our animals as well. If we only want to see summary data for a fixed set of animals, we can pass an array containing the animal_ids that we want to see summarized.
# create a custom vector of Animal IDs to pass to the summary function
# look only for these ids when doing your summary
tagged_fish <- c('22', '23')
sum_animal_custom <- summarize_detections(det=detections_filtered,
animals=tagged_fish, # Supply the vector to the function
location_col = 'station',
summ_type='animal')
sum_animal_custom
Now that we have an overview of how to quickly and elegantly summarize our data, let’s make our dataset more amenable to plotting by reducing it from detections to detection events.
Detection Events differ from detections in that they condense a lot of temporally and spatially clustered detections for a single animal into a single detection event. This is a powerful and useful way to clean up the data, and makes it easier to present and clearer to read. Fortunately, this is easy to do with glatos
.
# Reduce Detections to Detection Events ####
# ?glatos::detection_events
# you specify how long an animal must be absent before starting a fresh event
events <- detection_events(detections_filtered,
location_col = 'station',
time_sep=3600)
head(events)
location_col
tells the function what to use as the locations by which to group the data, while time_sep
tells it how much time has to elapse between sequential detections before the detection belongs to a new event (in this case, 3600 seconds, or an hour). The threshold for your data may be different depending on the purpose of your project.
We can also keep the full extent of our detections, but add a group column so that we can see how they would have been condensed.
# keep detections, but add a 'group' column for each event group
detections_w_events <- detection_events(detections_filtered,
location_col = 'station',
time_sep=3600, condense=FALSE)
Key Points
BONUS - More Features of glatos
Overview
Teaching: 15 min
Exercises: 0 minQuestions
What other features does glatos offer?
Objectives
glatos
has more advanced analytic tools that let you manipulate your data further. We’ll cover a few of these features now, to show you how to take your data beyond just filtering and event creation. We’ll also show you how to move your data from glatos
to VTrack, another powerful suite of data manipulation tools. By combining the glatos
package’s powerful built-in functions with its interoperability across scientific R packages, we’ll show you how to derive powerful insights from your data, and format it in a way that lets you demonstrate them.
glatos
can be used to get the residence index of your animals at all the different stations. In fact, glatos
offers five different methods for calculating Residence Index. For this lesson, we will showcase two of them, but more information on the others can be found in the glatos
documentation.
The residence_index()
function requires an events object to create a residence index. We will start by creating a subset like we did in the last lesson. With a dataset of this size, it is not strictly necessary, but it is useful to know how to do. On larger datasets, the residence_index()
function can take a prohibitively long time to run, and as such there are instances in which you will not want to use the full dataset. Another example of subsetting is therefore helpful.
First we will decide which animals to base our subset on. To help us with this, we can use group_by
on the events object to make it easier to identify good candidates.
#Using all the events data will take too long, so we will subset to just use a couple animals
events %>% group_by(animal_id) %>% summarise(count=n()) %>% arrange(desc(count))
#In this case, we have already decided to use these two animal IDs as the basis for our subset.
subset_animals <- c('22', '153')
events_subset <- events %>% filter(animal_id %in% subset_animals)
events_subset
Now that we have a subset of our events object, we can apply the residence_index
functions.
# Calc residence index using the Kessel method
rik_data <- residence_index(events_subset,
calculation_method = 'kessel')
# "Kessel" method is a special case of "time_interval" where time_interval_size = "1 day"
rik_data
# Calc residence index using the time interval method, interval set to 6 hours
rit_data <- residence_index(events_subset,
calculation_method = 'time_interval',
time_interval_size = "6 hours")
rit_data
Although the code we’ve written for each method of calculating the residence index is similar, the different parameters and calculation methods mean that these will return different results. It is up to you to investigate which of the methods within glatos
best suits your data and its intended application.
We will continue with glatos
for one more lesson, in which we will cover some basic, but very versatile visualization functions provided by the package.
Key Points
BONUS - Basic Visualization and Plotting with glatos
Overview
Teaching: 30 min
Exercises: 0 minQuestions
How can I use glatos to plot my data?
What kinds of plots can I make with my data?
Objectives
Now that we’ve cleaned and processed our data, we can use glatos
’ built-in plotting tools to make quick and effective visualizations out of it. One of the simplest visualizations is an abacus plot to display animal detections against the appropriate stations. To this end, glatos
supplies a built-in, customizable abacus_plot
function.
# Visualizing Data - Abacus Plots ####
# ?glatos::abacus_plot
# customizable version of the standard VUE-derived abacus plots
abacus_plot(detections_w_events,
location_col='station',
main='Walleye detections by station') # can use plot() variables here, they get passed thru to plot()
This is good, but you can see that the plot is cluttered. Rather than plotting our entire dataset, let’s try filtering out a single animal ID and only plotting that. We can do this right in our call to abacus_plot
with the filtering syntax we’ve previously covered.
# pick a single fish to plot
abacus_plot(detections_filtered[detections_filtered$animal_id=="22",],
location_col='station',
main="Animal 22 Detections By Station")
Other plots are available in glatos
and can show different facets of our data. If we want to see the physical distribution of our stations, for example, a bubble plot will serve us better.
# Bubble Plots for Spatial Distribution of Fish ####
# bubble variable gets the summary data that was created to make the plot
?detection_bubble_plot
bubble_station <- detection_bubble_plot(detections_filtered,
location_col = 'station',
out_file = 'walleye_bubbles_by_stations.png')
bubble_station
bubble_array <- detection_bubble_plot(detections_filtered,
out_file = 'walleye_bubbles_by_array.png')
bubble_array
These examples provide just a brief introduction to some of the plotting available in glatos
.
Glatos Challenge
Challenge 1 —- Create a bubble plot of the station in Lake Erie only. Set the bounding box using the provided nw + se cordinates and resize the points. As a bonus, add points for the other receivers in Lake Erie. Hint: ?detection_bubble_plot will help a lot Here’s some code to get you started
erie_arrays <-c("DRF", "DRL", "DRU", "MAU", "RAR", "SCL", "SCM", "TSR") #given nw <- c(43, -83.75) #given se <- c(41.25, -82) #given
Solution
erie_arrays <-c("DRF", "DRL", "DRU", "MAU", "RAR", "SCL", "SCM", "TSR") #given nw <- c(43, -83.75) #given se <- c(41.25, -82) #given bubble_challenge <- detection_bubble_plot(detections_filtered, background_ylim = c(nw[1], se[1]), background_xlim = c(nw[2], se[2]), symbol_radius = 0.75, location_col = 'station', col_grad = c('white', 'green'), out_file = 'glatos_bubbles_challenge.png')
Key Points
Other OTN Telemetry Curriculums
Overview
Teaching: 0 min
Exercises: 0 minQuestions
How can I expand my learning?
Objectives
OTN has hosted other workshops in the past which contain different code sets that may be useful to explore after this workshop.
-
IdeasOTN Telemetry Workshop Series 2020: code available here and videos available on our YouTube here
Many of our Intro to R workshops are based upon this curriculum from The Carpentries.
Key Points