Tidying up a data set: A Case Study with TB data
Tidy Data
Structure of tidy data
rows (cases/observational units) and
columns (variables).
The key is that every row is a case and every column is a variable.
No exceptions.
But what should the rows be?
That depends on the purpose.
- For plotting, we often want “glyph ready data”
- one row per glyph that gets drawn
- columns contain variables that get mapped to attributes
- The answer may be different at different points in an analysis (or for different layers in the same plot)
You should always know what the rows represent.
Case Study: Tidying up WHO data
The Data
The who data set in the tidyr package contains data from the World Health Organization Global Tuberculosis Report for 1995-2013. You can find out more about this data at http://www.who.int/tb/country/data/download/en/
The population data set contains global populations from 1980 to 2013.
An Inconvenient Form: Data in variables
For many purposes, the WHO data are in an inconvenient format. Each variable new_sp_m014 through newrel_f65 counts the number of new TB cases of a certain kind recorded in a given country, in a given year. The documentation says the variable names are constructed as follows:
They all begin
newThe next bit is a code for the method of diagnosis
rel= relapse,sn= negative pulmonary smear,sp= positive pulmonary smear,ep= extrapulmonary- Note: The separating
_is missing forrelbut present for the others.
After another
_there is amor anffor male or femaleThis is followed directly by an age code
- 014 = 0-14 yrs of age, 1524 = 15-24 years of age, 2534 = 25 to 34 years of age, 3544 = 35 to 44 years of age, 4554 = 45 to 54 years of age, 5564 = 55 to 64 years of age, 65 = 65 years of age or older
This format is sometimes called data-in-variables format, since some of the information that we might think of as data values is stored in the names of the variables.
Discussion
For what purposes might this be tidy data?
Exercises
- How many countries are in each data set (
whoandpopulation)?
who |>
group_by(country) |>
summarise(n = n()) |>
mutate(id = n():1)
population |>
group_by(country) |>
summarise(n = n()) |>
mutate(id = n():1)- What countries are listed in one data set but not in the other?
There are a number of ways to do this. This approach gives us some additional useful information.
bind_cols(
who |>
group_by(country) |>
summarise(n = n()) |>
mutate(id = n():1),
population |>
group_by(country) |>
summarise(n = n()) |>
mutate(id = n():1)
) |>
filter(country != country1 | id != id1)
This solution uses
anti_join()to find the rows that cannot be matched by countrypull()to grab only the country variableunique()to list each country only once
who |> anti_join(population, by = "country") |> pull(country) |> unique()
population |> anti_join(who, by = "country") |> pull(country) |> unique()Reshaping the data.
Goal:
row = a person group in a given year and country
person group described by variables
country,iso2,iso3,year,sex,diagnosis,age_groupadditional variable:
count= number of cases
This format is good for many purposes (including subsequent reshaping and tidying).
Fixing inconsistent variable names
Let start by making the naming scheme a bit more uniform. We will replace newrel with new_rel.
Note that setNames() returns a new object, with the names potentially modified from the original object’s names. It does not modify the object in place. So if you want the new names to persist, remember to save the result of using setNames().
Note: We could use stringr::str_replace() instead of sub()
pivot_longer()
pivot_longer() is used to convert the names and values of multiple columns into multiple two-column rows. In each pair, one column will contain the original names (names_to) and the other will contain the original values (values_to).
Here is a demonstration using just the 2000 observations from the USA.
Notice how columns 5 - 60 (the ones that include “new” in their names) turned into two columns (code and new_cases). Columns 1 - 4, which were not included in our list of columns, were duplicated.
If we omit the filter() command, we can process the entire data set.
Note: The inverse operation to pivot_loger() is called pivot_wider(). It takes a pair of columns and turns one column into variable names (names_from) and the other into values (values_from).
separate()
Now we need to separate our code column into separate columns, one for each piece of information coded.
And subsequently we can separate the sex from the age group using position rather than a character to do the separation.
Note: The inverse operation of separate() is unite(). (paste() is often useful as well.)
Exercises
- Add on some additional code to check that things are working.
- Show that the only values of
sexaremandfand that they appear in equal numbers. - Do a similar thing for the diagnosis groups and age groups
# here is the solution for checking that we have equal numbers of males and females.
# you can
who |>
setNames(sub("newrel", "new_rel", names(who))) |>
pivot_longer(names_to = "code", values_to = "new_cases", matches("new")) |>
separate(code, c("status", "diagnosis", "sexage"), sep = "_") |>
separate(sexage, c("sex", "age_group"), sep = 1) |>
group_by(sex) |>
summarise(n = n())- Construct a plot that shows the total number of new TB cases over time for each country.
- That’s a lot of countries. Perhaps you might like to remake the plot using only a subset of countries. Which countries?
# Note: This will be more useful once we can scale by total population
who |>
setNames(sub("newrel", "new_rel", names(who))) |>
pivot_longer(names_to = "code", values_to = "new_cases", matches("new")) |>
separate(code, c("status", "diagnosis", "sexage"), sep = "_") |>
separate(sexage, c("sex", "age_group"), sep = 1) |>
group_by(country, iso3, year) |>
summarise(total_cases = sum(new_cases, na.rm = TRUE)) |>
filter(iso3 %in% c("USA", "IND", "CHN", "NGA", "DEU", "GHA")) |>
gf_line(total_cases ~ year, color = ~country)Joins
Comparing TB cases to population
More interesting than the number of TB cases is the fraction of the population that is contracting TB.
whodata does not include population.- the
populationdata has population information, but nothing about TB
We need to join these data sets together so we can compute the TB rate for each country (in each year).
Both data sets have a country variable that we can use to match things up between data sets.
We will consider our modified
whodata to be primary data and add in population information frompopulation.Unfortunately, two country names are spelled differently in
whoandpopulation, so we will need to address this before joining.
The join functions
dplyr has a number of functions with “join” in the name that are used to bring data from two sources together into one. The basic syntax is
LeftData |>
some_join(RightData, ...)where some_join() is one of
inner_join(): return all rows fromLeftDatawhere there are matching values inData2, and all columns fromLeftDataandRightData. If there are multiple matches betweenLeftDataandRightData, all combination of the matches are returned.left_join(): return all rows fromLeftData, and all columns fromLeftDataandRightData. Rows inLeftDatawith no match inRightDatawill have NA values in the new columns. If there are multiple matches betweenLeftDataandRightData, all combinations of the matches are returned.right_join(): return all rows fromRightData, and all columns fromLeftDataandRightData. Rows inRightDatawith no match inLeftDatawill have NA values in the new columns. If there are multiple matches betweenLeftDataandRightData, all combinations of the matches are returned.semi_join(): return all rows fromLeftDatawhere there are matching values inRightData, keeping just columns fromLeftData. A semi join differs from an inner join because an inner join will return one row ofLeftDatafor each matching row ofRightData, where a semi join will never duplicate rows ofLeftData.anti_join(): return all rows fromLeftDatawhere there are not matching values inRightData, keeping just columns fromLeftData.
full_join(): return all rows and all columns from both LeftData and RightData. Where there are not matching values, returns NA for the one missing.
Using anti_join() to detect joining problems
We can recode the countries in population to match the names in who using tidyr::case_when().
Let’s save this with a new name so we can reuse it below.
We want a left_join() of our modified who data with the population data. By default, matching is on all columns that have the same names. That works just right for us in this case. When the columns for matching have different names, there are ways to specify which names in LeftData correspond with which names in RightData (or we can rename the variables to make them match).
A message is displayed to show which variables were used for matching. It’s good to check that those are what you expect.
Exercise
Use the joined data below to make a plot showing the distribution of TB rates.
who |>
setNames(sub("newrel", "new_rel", names(who))) |>
pivot_longer(names_to = "code", values_to = "new_cases", matches("new")) |>
separate(code, c("status", "diagnosis", "sexage"), sep = "_") |>
separate(sexage, c("sex", "age_group"), sep = 1) |>
left_join(population2) |>
group_by(country, year) |>
summarise(
total_cases = sum(new_cases, na.rm = TRUE),
tb_rate = total_cases / max(population)
) |>
filter(year == 2010) |>
gf_dens(~ tb_rate)Exercise
Use the joined data below to locate the ten countries with the highest average TB rate over the years for which they reported data.
Plot the TB rates over time for these ten countries.
who |>
setNames(sub("newrel", "new_rel", names(who))) |>
pivot_longer(names_to = "code", values_to = "new_cases", matches("new")) |>
separate(code, c("status", "diagnosis", "sexage"), sep = "_") |>
separate(sexage, c("sex", "age_group"), sep = 1) |>
left_join(population2) |>
group_by(country, iso3, year) |>
summarise(
total_cases = sum(new_cases, na.rm = TRUE),
tb_rate = total_cases / max(population)
) |>
group_by(country, iso3) |>
mutate(mean_tb_rate = mean(tb_rate, na.rm = TRUE)) |>
ungroup() |>
arrange(-mean_tb_rate) |>
head(10 * (2013 - 1979)) |>
gf_line(tb_rate ~ year, color = ~ iso3) |>
gf_lims(x = c(1990, NA))A few other things
We didn’t need them in this case study, but here are some additional functions that can be useful for tidying up your data.
Fancy table output
There are a number of packages that provide fancier output of data tables. Some only work in certain document formats. Some are highly customizable.
Here is an example of a customized table output with {gt}. We have changed the number formatting for population figures and added a number of interactive features. See the documentation for details and additional features.
Extracting values from text
readr::parse_number(), readr::parse_integer(), readr::parse_date(), etc.: These functions read through the values in a variable and do there best to extract the type of information requested. The R ingestion functions will typically turn what you expect to be numeric values into text if there are any values that it cannot interpret as a number. So this can be very handy when bringing in data created in other software.
Exercises
Enter some other types of text to see how
parse_number()converts things.Try changing
parse_number()to one of the other parsers.
Add
na = c("", "-")to theparse_number()call above. What does that do?Dates are tricky because there are so many formats, and a date can be ambiguous without knowledge of the format used. (What date is 10-11-12?)
lubridate()provides functions likemdy(),ymd(), etc. that allow us to specify the order of year, month, and day. (Addhmsif you also have times to process.) Editsome_datesbelow to see what formats are correctly parsed.
The lubridate package has many other utilities for working with dates and times.