Data Wrangling in the Tidyverse

Author

Randall Pruim

Published

October 26, 2024

About this Tutorial

Toolkit

This tutorial was written using WebR.

A little help from my friends

This tutorial has evolved from presentations and tutorials made by

Randall Pruim
Danny Kaplan
Jo Hardin
Miles Ott
Nicholas Horton

Data used in the tutorial

For simplicity, all the data in this tutorial come from R packages. We will rename some data sets so that in this tutorial data sets are capitalized and variables (mostly) are not.

Note

This chunk is autorun when this tutorial is loaded to make sure you have the data available to you throughout the tutorial. Most other chunks are not run until you hit the play button.

What is the tidyverse?

A collection of R packages

written by Hadley Wickham (and friends)
that play well together
to create and manage
tidy data.

data I have \(\stackrel{\mbox{tidyverse}}{\longrightarrow}\) data I want

Tidy data

Rows are cases (of interest)
Columns are variables

Tidy vs Clean

As grandma says: tidy and clean are two different things

tidy data may not be clean, but
tidy data are easier to clean,
and it is easier to tell whether data are clean when they are tidy

Glyph Ready Data

Many plotting libraries (including ggformula and ggplot2) require or work best with “glyph ready” data:

information for each glyph (a mark on the “page”) is in a single row.
attributes of each glyph are determined by values of variables in that row.

tidyverse tools are useful for creating glyph ready data.

Chaining with the native pipe (|>)

The following are essentially equivalent

f(x, y, ...)

x |> f(y, ...)

Your Turn

Add a second line that is equivalent to the first but uses |>. (Note: Unless you use set.seed() you will get different results each time because of different random choices.)

Little Bunny Foo Foo

Here’s a silly example that illustrates why |> is helpful.

Little bunny Foo Foo
Went hopping through the forest
Scooping up the field mice
And bopping them on the head.

Suppose we wanted to express this poem in code…

Without chaining

bop(scoop(hop(foo_foo, through = forest), up = field_mice), on = head)

Or perhaps

bop(
  scoop(
    hop(
      foo_foo, through = forest), 
    up = field_mice), 
  on = head)

With chaining

Rewrite the code above using |>.

foo_foo |>
  hop(through = forest) |>
  scoop(up = field_mice) |>
  bop(on = head)

Definitions of functions and variables that make this work

foo_foo <- "Little Bunny Foo Foo"
forest <- "the forest"
field_mice <- "the field mice"
head <- "the head"
hop <- function(object, through) {
  paste0(object, '\nWent hopping through ', through) 
}
scoop <- function(object, up) {
  paste0(object, "\nScooping up", up) 
}
bop <- function(object, on) {
  paste0(object, "\nAnd bopping them on ", on) |> cat()
}

Advantages of Chaining

reduces parenthesis chasing
keeps arguments near functions
mirrors order of operations (making code easier to read)
facilitates code editing (easy to eliminate or insert a step)

Data Verbs

Chaining in the tidyverse

Many functions in tidyverse

take tidy data as a first argument
return tidy data as a result
do not have side effects

This sets them up for chaining:

new_data <-
  old_data |>
  do_this( ... ) |>
  and_that( ... ) |>
  and_the_other( ... )

We will refer to these kinds of functions as (transitive) data verbs because they do something to (tidy) data without side effects.

Learning tidyverse Tools

There are basically two steps:

1. Learn how each data verb works in isolation.	(Low volume)
2. Break down complex tasks into a sequence steps	(Creativity)

Movies Data

The Movies data set contains information about movies.

Your Turn

Change glimpse() to skim() in the previous box; skimr::skim() provides an alternative way to take a quick look at a data set.
Now change to slice_head(). What does this function do?
Now change to nrow(). What does this function do?
Now change to names(). What does this function do?
Want more information? Try ?movies. (Note lower case here.)

Quiz

Is skim() a data verb?
Is glimpse() a data verb?
Is slice_head() a data verb?
Is nrow() a data verb?
Is names() a data verb?

6 Main Data Verbs

We’ll go through each of these in more detail in just a moment, but here are the key data verbs.

Reduction

summarise(): compute summary statistics

result is a 1-row data frame. (This is a white lie, but useful for now.)

Operations on Rows

filter(): keep rows that match logical condition (discard the others)
arrange(): reorder the rows (i.e., sort)

Operations on Columns

select(): keep some variables (columns) (discard the others)
mutate(): create new variables and add them to the data

Grouping

group_by(): data unchanged, but groups are noted.
- subsequent verbs are processed groupwise
- ungroup(): clear grouping notes

But wait, there’s more!

Here are some other data verbs

transmute() like mutate(), but returns only the new variables
rename(): rename some of the variables
distinct(): returns each unique row once
slice_sample(): returns randomly selected row(s)
slice_head(): returns the first few rows [ignores groups]
slice_tail(): returns the last few rows [ignores groups]
do(): used to create custom operations

One-row summaries with summarise()

Note: summarize() works for non New Zelanders, but it conflicts with a function with that same name in another package.

Note: n() only works in special tidyverse functions (like summarise()) and computes the same value as nrow() but doesn’t require us to name the data set inside summarise(). (This will be more important shortly.)

Multi-tasking allowed

Your turn

Add one more variable to the example above: the total length of all the movies in years.

# mean and total number of minutes of all the movies
Movies |> 
  summarise(
    n = n(),
    mean_length  = mean(length, na.rm = TRUE), 
    total_length =  sum(length, na.rm = TRUE),
    total_length_years =  sum(length, na.rm = TRUE) / (60 * 24 * 365),
    total_length_years2 =  total_length / (60 * 24 * 365)
    )

Useful functions for use with summarise()

min(), max(), mean(), sum(), sd(), median(), IQR(), …
n(): number of observations in the current group
n_distinct(): number of unique values of a variable
first_value(), last_value(), nth_value(x, n)

select()

Let’s make a data set that has fewer variables.

Reminder: select() is for columns/variables
Can use : to specify ranges of variables: title : budget
Can use - to deselect variables: -length

Your Turn

Select some variables of your own choosing.

Creating subsets – filter() and friends

Often it is useful to work with a subset of the data

inspect a few rows to understand the data
small data sets for prototyping analysis methods
analyses applied only to an “interesting” subset of the data
training and test data sets

Beginning and End

Random sample

filter()

Let’s use only use movies (cases) that have budget information and have shorter titles.

Creating new variables – mutate() and friends

mutate() can be used to add new variables (columns) that are calculated from existing variables. For example, let’s compute the dollars per minute (of runtime) for our films.

Your Turn

Change mutate() to transmute(). How do they differ?

group_by()

group_by() is what makes this system really hum. When tidy data is marked with groups, some subsequent data verbs are applied within groups.

Hint/Example

Here are some examples. But you should create your own that are different.

# mean length of movies in hours for all the movies, broken down by title length
Movies2 |> 
  mutate(
    hours = length / 60,
    dpm = budget / length) |> 
  group_by(title_length = nchar(title)) |> 
  summarise(
    n = n(), 
    mean_length = mean(hours, na.rm = TRUE),
    min_length  =  max(hours, na.rm = TRUE),
    max_length  =  min(hours, na.rm = TRUE)
    ) |>
  gf_pointrange(mean_length + min_length + max_length ~ title_length)

Your Turn

Add some additional summaries of your choosing.
Group by length of title instead of by rating.
Bonus: show the results with a plot rather than a table.

Warning

Groups are in some sense “invisible”, and it is easy to forget about them. When saving data that have been created with a grouping, you may want to use ungroup() to remove the grouping information and let future you start fresh – you have been warned…

group() and mutate()

group() impacts some other operations as well. What happens when mutate() is applied after group_by()? Give it a try and see.

arrange() reorders the cases

Example

Note about groups

By default, arrange() ignores groups. If you use .by_group = TRUE, then sorting will be done by grouping variables first, then by additional variables specified.

Group, summarise, arrange

Group-summarise-arrange is a common operation sequence to list groups in an order based on some calculation.

Your Turn

Fill in the missing pieces here to list the ratings in order of average length of film. Feel free to include additional summary quantities as well if you like.

What else is there?

Lots more. Here are few examples.

pivot_wider() and pivot_longer() for converting from “long” to “wide” or from “wide” to “long” shapes
separate() splitting a single variable into multiple variables.
bind_rows() and bind_cols() to (intelligently) glue data frames top-to-bottom or left-to-right.
left_join(), right_join(), inner_join(), etc. for merging data from multiple sources. (Like joins in SQL.)
The lubridate packages has functions for working with dates and times

TB Case Study tutorial illustrates some of these in a case study using tuberculosis data from the World Health Organization.

bind_rows()

The easiest way to combine data from multiple sources is when each source represents “more of the same thing”. Each data set might represent one site, one date (range), one researcher, etc.

bind_rows() will stack data sets, using column names to align and filling in with NA if some columns are missing in some data sets.

Database connections

These same commands can be used to manage data in an SQL data base.

converts data verbs into SQL query
allows you to inspect the query
only fetches a few rows unless you ask for more (to save time while you are debugging)

# establish connection -- details vary by type of database
con <- DBI::dbConnect(RMySQL::MySQL(), 
  host = "database.rstudio.com",
  user = "hadley",
  password = rstudioapi::askForPassword("Database password")
)

# extract a table
flights_db <- tbl(con, "flights")

# treat it just like in-memory data (mostly)
flights_db |>
  group_by(...) |>
  mutate(...) |>
  summarise(...) |>
  collect()          # get me all of the data

Notes

Some R commands can’t be translated into SQL and so don’t work inside mutate(), summarise(), etc.
Recent R Studio releases include expanded support for working with databases.

Your Turn – More Movies

A Smaller Data Set

When starting, it can be helpful to work with a small subset of the data. When you have your data wrangling statements in working order, shift to the entire data table.

Let’s get just recent movies that have an MPAA rating category.

You can use the smaller data set as you are figuring out the solutions to these exercises. Once have it sussed, you can switch to the full Movies data.

Your Turn

What is the average IMDB user rating?

What is the average IMDB user rating of movies for each mpaa category?

How many Action Movies in each year?

How many Comedies of each mpaa rating in each year?

Track the average IMDB ratings for movies with mpaa “R” over the years.

Track the number of IMDB ratings given to R-rated movies over the years

You could use total number or average number. Your choice.

(Hint: Use votes to see how many votes were cast for each movie.)

Track the average IMDB ratings for movies in each mpaa rating group over the years.

Challenge

Redo the previous problem but average over voters rather than movies (because some movies have many more votes than others).

You are on your way

Keep learning

TB Case Study demonstrates a data tidying project based on WHO data on TB cases.
RStudio webinar
RStudio Cheatsheets

Keep practicing

Use Babynames or Flights to hone your skills.

Keep a stiff upper lip