Data Wrangling in the Tidyverse
About this Tutorial
Toolkit
This tutorial was written using WebR.
A little help from my friends
This tutorial has evolved from presentations and tutorials made by
- Randall Pruim
- Danny Kaplan
- Jo Hardin
- Miles Ott
- Nicholas Horton
Data used in the tutorial
For simplicity, all the data in this tutorial come from R packages. We will rename some data sets so that in this tutorial data sets are capitalized and variables (mostly) are not.
This chunk is autorun when this tutorial is loaded to make sure you have the data available to you throughout the tutorial. Most other chunks are not run until you hit the play button.
What is the tidyverse?
A collection of R packages
written by Hadley Wickham (and friends)
that play well together
to create and manage
tidy data.
data I have \(\stackrel{\mbox{tidyverse}}{\longrightarrow}\) data I want
Tidy data
Rows are cases (of interest)
Columns are variables
Tidy vs Clean
As grandma says: tidy and clean are two different things
- tidy data may not be clean, but
- tidy data are easier to clean,
- and it is easier to tell whether data are clean when they are tidy
Glyph Ready Data
Many plotting libraries (including ggformula
and ggplot2
) require or work best with “glyph ready” data:
- information for each glyph (a mark on the “page”) is in a single row.
- attributes of each glyph are determined by values of variables in that row.
tidyverse
tools are useful for creating glyph ready data.
Chaining with the native pipe (|>)
The following are essentially equivalent
f(x, y, ...)
|> f(y, ...) x
Your Turn
Add a second line that is equivalent to the first but uses |>
. (Note: Unless you use set.seed()
you will get different results each time because of different random choices.)
Little Bunny Foo Foo
Here’s a silly example that illustrates why |>
is helpful.
Little bunny Foo Foo
Went hopping through the forest
Scooping up the field mice
And bopping them on the head.
Suppose we wanted to express this poem in code…
Without chaining
bop(scoop(hop(foo_foo, through = forest), up = field_mice), on = head)
Or perhaps
bop(
scoop(
hop(
through = forest),
foo_foo, up = field_mice),
on = head)
With chaining
Rewrite the code above using |>
.
|>
foo_foo hop(through = forest) |>
scoop(up = field_mice) |>
bop(on = head)
<- "Little Bunny Foo Foo"
foo_foo <- "the forest"
forest <- "the field mice"
field_mice <- "the head"
head <- function(object, through) {
hop paste0(object, '\nWent hopping through ', through)
}<- function(object, up) {
scoop paste0(object, "\nScooping up", up)
}<- function(object, on) {
bop paste0(object, "\nAnd bopping them on ", on) |> cat()
}
Advantages of Chaining
- reduces parenthesis chasing
- keeps arguments near functions
- mirrors order of operations (making code easier to read)
- facilitates code editing (easy to eliminate or insert a step)
Data Verbs
Chaining in the tidyverse
Many functions in tidyverse
take tidy data as a first argument
return tidy data as a result
do not have side effects
This sets them up for chaining:
<-
new_data |>
old_data do_this( ... ) |>
and_that( ... ) |>
and_the_other( ... )
We will refer to these kinds of functions as (transitive) data verbs because they do something to (tidy) data without side effects.
Learning tidyverse Tools
There are basically two steps:
1. Learn how each data verb works in isolation. | (Low volume) |
2. Break down complex tasks into a sequence steps | (Creativity) |
Movies Data
The Movies
data set contains information about movies.
Your Turn
Change
glimpse()
toskim()
in the previous box;skimr::skim()
provides an alternative way to take a quick look at a data set.Now change to
slice_head()
. What does this function do?Now change to
nrow()
. What does this function do?Now change to
names()
. What does this function do?Want more information? Try
?movies
. (Note lower case here.)
Quiz
Is
skim()
a data verb?Is
glimpse()
a data verb?Is
slice_head()
a data verb?Is
nrow()
a data verb?Is
names()
a data verb?
6 Main Data Verbs
We’ll go through each of these in more detail in just a moment, but here are the key data verbs.
Reduction
summarise()
: compute summary statistics
- result is a 1-row data frame. (This is a white lie, but useful for now.)
Operations on Rows
filter()
: keep rows that match logical condition (discard the others)arrange()
: reorder the rows (i.e., sort)
Operations on Columns
select()
: keep some variables (columns) (discard the others)mutate()
: create new variables and add them to the data
Grouping
group_by()
: data unchanged, but groups are noted.- subsequent verbs are processed groupwise
ungroup()
: clear grouping notes
But wait, there’s more!
Here are some other data verbs
transmute()
likemutate()
, but returns only the new variablesrename()
: rename some of the variablesdistinct()
: returns each unique row onceslice_sample()
: returns randomly selected row(s)slice_head()
: returns the first few rows [ignores groups]slice_tail()
: returns the last few rows [ignores groups]do()
: used to create custom operations
One-row summaries with summarise()
Note: summarize()
works for non New Zelanders, but it conflicts with a function with that same name in another package.
Note: n()
only works in special tidyverse
functions (like summarise()
) and computes the same value as nrow()
but doesn’t require us to name the data set inside summarise()
. (This will be more important shortly.)
Multi-tasking allowed
Your turn
Add one more variable to the example above: the total length of all the movies in years.
# mean and total number of minutes of all the movies
|>
Movies summarise(
n = n(),
mean_length = mean(length, na.rm = TRUE),
total_length = sum(length, na.rm = TRUE),
total_length_years = sum(length, na.rm = TRUE) / (60 * 24 * 365),
total_length_years2 = total_length / (60 * 24 * 365)
)
Useful functions for use with summarise()
min()
,max()
,mean()
,sum()
,sd()
,median()
,IQR()
, …n()
: number of observations in the current groupn_distinct()
: number of unique values of a variablefirst_value()
,last_value()
,nth_value(x, n)
select()
Let’s make a data set that has fewer variables.
- Reminder:
select()
is for columns/variables - Can use
:
to specify ranges of variables:title : budget
- Can use
-
to deselect variables:-length
Your Turn
Select some variables of your own choosing.
Creating subsets – filter() and friends
Often it is useful to work with a subset of the data
- inspect a few rows to understand the data
- small data sets for prototyping analysis methods
- analyses applied only to an “interesting” subset of the data
- training and test data sets
Beginning and End
Random sample
filter()
Let’s use only use movies (cases) that have budget information and have shorter titles.
Creating new variables – mutate() and friends
mutate()
can be used to add new variables (columns) that are calculated from existing variables. For example, let’s compute the dollars per minute (of runtime) for our films.
Your Turn
Change mutate()
to transmute()
. How do they differ?
group_by()
group_by()
is what makes this system really hum. When tidy data is marked with groups, some subsequent data verbs are applied within groups.
Hint/Example
Here are some examples. But you should create your own that are different.
# mean length of movies in hours for all the movies, broken down by title length
|>
Movies2 mutate(
hours = length / 60,
dpm = budget / length) |>
group_by(title_length = nchar(title)) |>
summarise(
n = n(),
mean_length = mean(hours, na.rm = TRUE),
min_length = max(hours, na.rm = TRUE),
max_length = min(hours, na.rm = TRUE)
|>
) gf_pointrange(mean_length + min_length + max_length ~ title_length)
Your Turn
- Add some additional summaries of your choosing.
- Group by length of title instead of by rating.
- Bonus: show the results with a plot rather than a table.
Groups are in some sense “invisible”, and it is easy to forget about them. When saving data that have been created with a grouping, you may want to use ungroup()
to remove the grouping information and let future you start fresh – you have been warned…
group() and mutate()
group()
impacts some other operations as well. What happens when mutate()
is applied after group_by()
? Give it a try and see.
arrange() reorders the cases
Example
By default, arrange()
ignores groups. If you use .by_group = TRUE
, then sorting will be done by grouping variables first, then by additional variables specified.
Group, summarise, arrange
Group-summarise-arrange is a common operation sequence to list groups in an order based on some calculation.
Your Turn
Fill in the missing pieces here to list the ratings in order of average length of film. Feel free to include additional summary quantities as well if you like.
|>
Movies2 group_by(mpaa) |>
summarise(mean_length = mean(length)) |>
arrange(mean_length)
What else is there?
Lots more. Here are few examples.
pivot_wider()
andpivot_longer()
for converting from “long” to “wide” or from “wide” to “long” shapesseparate()
splitting a single variable into multiple variables.bind_rows()
andbind_cols()
to (intelligently) glue data frames top-to-bottom or left-to-right.left_join()
,right_join()
,inner_join()
, etc. for merging data from multiple sources. (Like joins in SQL.)The
lubridate
packages has functions for working with dates and times
TB Case Study tutorial illustrates some of these in a case study using tuberculosis data from the World Health Organization.
bind_rows()
The easiest way to combine data from multiple sources is when each source represents “more of the same thing”. Each data set might represent one site, one date (range), one researcher, etc.
bind_rows()
will stack data sets, using column names to align and filling in with NA
if some columns are missing in some data sets.
Database connections
These same commands can be used to manage data in an SQL data base.
- converts data verbs into SQL query
- allows you to inspect the query
- only fetches a few rows unless you ask for more (to save time while you are debugging)
# establish connection -- details vary by type of database
<- DBI::dbConnect(RMySQL::MySQL(),
con host = "database.rstudio.com",
user = "hadley",
password = rstudioapi::askForPassword("Database password")
)
# extract a table
<- tbl(con, "flights")
flights_db
# treat it just like in-memory data (mostly)
|>
flights_db group_by(...) |>
mutate(...) |>
summarise(...) |>
collect() # get me all of the data
Some R commands can’t be translated into SQL and so don’t work inside
mutate()
,summarise()
, etc.Recent R Studio releases include expanded support for working with databases.
Your Turn – More Movies
A Smaller Data Set
When starting, it can be helpful to work with a small subset of the data. When you have your data wrangling statements in working order, shift to the entire data table.
Let’s get just recent movies that have an MPAA rating category.
You can use the smaller data set as you are figuring out the solutions to these exercises. Once have it sussed, you can switch to the full Movies
data.
Your Turn
What is the average IMDB user rating?
|>
RecentMovies summarise(avg = mean(rating))
What is the average IMDB user rating of movies for each mpaa category?
How many Action Movies in each year?
Hint: Action
contains 0 or 1 depending on whether the movie is an action movie.
How many Comedies of each mpaa rating in each year?
Hint: Comedy
is 1 for comedies, else 0.
Track the average IMDB ratings for movies with mpaa “R” over the years.
Track the number of IMDB ratings given to R-rated movies over the years
You could use total number or average number. Your choice.
(Hint: Use votes
to see how many votes were cast for each movie.)
Track the average IMDB ratings for movies in each mpaa rating group over the years.
Challenge
Redo the previous problem but average over voters rather than movies (because some movies have many more votes than others).
You are on your way
Keep learning
TB Case Study demonstrates a data tidying project based on WHO data on TB cases.
Keep practicing
Use Babynames
or Flights
to hone your skills.