Base R functions vs. Tidyverse

Overview

Teaching: 15 min
Exercises: 0 min

Questions

How can I introspect, subset, and plot my data using base R?

How do I reformat dates as ‘strings’ into date objects?

What are the Tidyverse functions that will do the same tasks?

Objectives

Learn how to perform basic R data manipulation tasks in base R, and using the Tidyverse

The Tidyverse and its component packages are reimagining how you work with tabular data in R. It includes component libraries and functions for organizing code for readability, cleaning data, and plotting it. While tidyverse is only one of a few competing paradigms for performing these jobs in R, here we will show a few examples of performing tasks in base R and again using Tidy methods.

First, some reminders on how to quickly get help with R functions.

Getting help with Functions

# Getting Help with Functions/Packages ####
?barplot
args(lm)
?lm

The Tidyverse

Intro - looking at data

OK, let’s load the Tidyverse libraries, and then load and explore a working dataset that is included in the example repository.

# Tidyverse - provides dplyr, ggplot2, and many other packages that simplify data.frame manipulation in R
library(tidyverse)

                      # Data and code helpfully provided by Dr. Robert Lennox of NORCE
load("seaTrout.rda")  # load 1.5m sea trout detections from a dataset in the Norwegian fjords.

# Lots of ways to get a clue about what this new data var is and how it looks. The general format of our exploration will be:

# Base R way:
str(seaTrout)

# Tidyverse option:
glimpse(seaTrout)

You can also call a function to just look at the first few rows, head()

# Base R way:
head(seaTrout)

# alternate way of calling the same head() function using Tidyverse's magrittr:
seaTrout %>% head

Challenge: what are some useful arguments you can pass to the head function?

OK, let’s do some more subsetting of our data. What if we want one column? How would we return the first 5 rows?

# Select one column by number, base and then Tidy
seaTrout[,c(6)]

seaTrout %>% select(6)


# Select the first 5 rows.
seaTrout[c(1:5),]

seaTrout %>% slice(1:5)

Question: what’s this c() business all about?

Summarizing data - finding unique values:

OK, something a little more practical, how would we determine what values are in our Species column? First base, then tidy.

# basic functions 1.2: how many species do we have?

nrow(data.frame(unique(seaTrout$Species)))

seaTrout %>% distinct(Species) %>% nrow

Dealing with Dates

While a whole module could be written on this topic, here we’ll show just the difference between formatting dates using base R and using lubridate, a wonderful library for managing date formats. This is also the first place we’ll start to see some minor speed differences in using Tidy vs. base functions.

# basic function 1.3: format date times

as.POSIXct(seaTrout$DateTime) # Check if your beverage needs refilling

seaTrout %>% mutate(DateTime=ymd_hms(DateTime))  # Lightning-fast, thanks to Lubridate's ymd_hms(), 'yammed hams'!
                                                 # For datestrings in American, try dmy_hms()

Filtering data

We’ll often want to subset data to evaluate what’s happening within a subgroup. Here, let’s look at selecting out all the Trout in our data vs. the salmon. Some small speed difference here if you can detect it.

# basic function 1.4: filtering

seaTrout[which(seaTrout$Species=="Trout"),]

seaTrout %>% filter(Species=="Trout")

Plotting

We’ll do some more complex stuff with ggplot in a minute, but if you have two axes and want to see them on a plot, here are the base R and ggplot ways of getting there quickly.

# basic function 1.5: plotting

plot(seaTrout$lon, seaTrout$lat)  # check again if beverage is full

seaTrout %>% ggplot(aes(lon, lat))+geom_point() # sometimes plots just take time

Summarizing Data - applying functions to get summaries

We’ll do some more complex stuff with ggplot in a minute, but if you have two axes and want to see them on a plot, here are the base R and ggplot ways of getting there quickly.

# basic function 1.6: getting data summaries

tapply(seaTrout$lon, seaTrout$tag.ID, mean) # get the mean value across a column

seaTrout %>%
  group_by(tag.ID) %>%
  summarise(mean=mean(lon))   # Can use pipes and newlines to organize this process

Questions:

What arguments does tapply() take?
What method of summarizing do you find more readable?

Key Points

Many of the tasks you will need to do to analyze your data are made easier or more readable via Tidyverse methods.

previous episode

DFO & OTN Acoustic Telemetry Workshop

next episode