Tidying up a data set: A Case Study with TB data
Tidy Data
Structure of tidy data
rows (cases/observational units) and
columns (variables).
The key is that every row is a case and every column is a variable.
No exceptions.
But what should the rows be?
That depends on the purpose.
- For plotting, we often want “glyph ready data”
- one row per glyph that gets drawn
- columns contain variables that get mapped to attributes
- The answer may be different at different points in an analysis (or for different layers in the same plot)
You should always know what the rows represent.
Case Study: Tidying up WHO data
The Data
The who
data set in the tidyr
package contains data from the World Health Organization Global Tuberculosis Report for 1995-2013. You can find out more about this data at http://www.who.int/tb/country/data/download/en/
The population
data set contains global populations from 1980 to 2013.
An Inconvenient Form: Data in variables
For many purposes, the WHO data are in an inconvenient format. Each variable new_sp_m014
through newrel_f65
counts the number of new TB cases of a certain kind recorded in a given country, in a given year. The documentation says the variable names are constructed as follows:
They all begin
new
The next bit is a code for the method of diagnosis
rel
= relapse,sn
= negative pulmonary smear,sp
= positive pulmonary smear,ep
= extrapulmonary- Note: The separating
_
is missing forrel
but present for the others.
After another
_
there is am
or anf
for male or femaleThis is followed directly by an age code
- 014 = 0-14 yrs of age, 1524 = 15-24 years of age, 2534 = 25 to 34 years of age, 3544 = 35 to 44 years of age, 4554 = 45 to 54 years of age, 5564 = 55 to 64 years of age, 65 = 65 years of age or older
This format is sometimes called data-in-variables format, since some of the information that we might think of as data values is stored in the names of the variables.
Discussion
For what purposes might this be tidy data?
Exercises
- How many countries are in each data set (
who
andpopulation
)?
|>
who group_by(country) |>
summarise(n = n()) |>
mutate(id = n():1)
|>
population group_by(country) |>
summarise(n = n()) |>
mutate(id = n():1)
- What countries are listed in one data set but not in the other?
There are a number of ways to do this. This approach gives us some additional useful information.
bind_cols(
|>
who group_by(country) |>
summarise(n = n()) |>
mutate(id = n():1),
|>
population group_by(country) |>
summarise(n = n()) |>
mutate(id = n():1)
|>
) filter(country != country1 | id != id1)
This solution uses
anti_join()
to find the rows that cannot be matched by countrypull()
to grab only the country variableunique()
to list each country only once
|> anti_join(population, by = "country") |> pull(country) |> unique()
who |> anti_join(who, by = "country") |> pull(country) |> unique() population
Reshaping the data.
Goal:
row = a person group in a given year and country
person group described by variables
country
,iso2
,iso3
,year
,sex
,diagnosis
,age_group
additional variable:
count
= number of cases
This format is good for many purposes (including subsequent reshaping and tidying).
Fixing inconsistent variable names
Let start by making the naming scheme a bit more uniform. We will replace newrel
with new_rel
.
Note that setNames()
returns a new object, with the names potentially modified from the original object’s names. It does not modify the object in place. So if you want the new names to persist, remember to save the result of using setNames()
.
Note: We could use stringr::str_replace()
instead of sub()
pivot_longer()
pivot_longer()
is used to convert the names and values of multiple columns into multiple two-column rows. In each pair, one column will contain the original names (names_to
) and the other will contain the original values (values_to
).
Here is a demonstration using just the 2000 observations from the USA.
Notice how columns 5 - 60 (the ones that include “new” in their names) turned into two columns (code
and new_cases
). Columns 1 - 4, which were not included in our list of columns, were duplicated.
If we omit the filter()
command, we can process the entire data set.
Note: The inverse operation to pivot_loger()
is called pivot_wider()
. It takes a pair of columns and turns one column into variable names (names_from
) and the other into values (values_from
).
separate()
Now we need to separate our code
column into separate columns, one for each piece of information coded.
And subsequently we can separate the sex from the age group using position rather than a character to do the separation.
Note: The inverse operation of separate()
is unite()
. (paste()
is often useful as well.)
Exercises
- Add on some additional code to check that things are working.
- Show that the only values of
sex
arem
andf
and that they appear in equal numbers. - Do a similar thing for the diagnosis groups and age groups
# here is the solution for checking that we have equal numbers of males and females.
# you can
|>
who setNames(sub("newrel", "new_rel", names(who))) |>
pivot_longer(names_to = "code", values_to = "new_cases", matches("new")) |>
separate(code, c("status", "diagnosis", "sexage"), sep = "_") |>
separate(sexage, c("sex", "age_group"), sep = 1) |>
group_by(sex) |>
summarise(n = n())
- Construct a plot that shows the total number of new TB cases over time for each country.
- That’s a lot of countries. Perhaps you might like to remake the plot using only a subset of countries. Which countries?
# Note: This will be more useful once we can scale by total population
|>
who setNames(sub("newrel", "new_rel", names(who))) |>
pivot_longer(names_to = "code", values_to = "new_cases", matches("new")) |>
separate(code, c("status", "diagnosis", "sexage"), sep = "_") |>
separate(sexage, c("sex", "age_group"), sep = 1) |>
group_by(country, iso3, year) |>
summarise(total_cases = sum(new_cases, na.rm = TRUE)) |>
filter(iso3 %in% c("USA", "IND", "CHN", "NGA", "DEU", "GHA")) |>
gf_line(total_cases ~ year, color = ~country)
Joins
Comparing TB cases to population
More interesting than the number of TB cases is the fraction of the population that is contracting TB.
who
data does not include population.- the
population
data has population information, but nothing about TB
We need to join these data sets together so we can compute the TB rate for each country (in each year).
Both data sets have a country
variable that we can use to match things up between data sets.
We will consider our modified
who
data to be primary data and add in population information frompopulation
.Unfortunately, two country names are spelled differently in
who
andpopulation
, so we will need to address this before joining.
The join functions
dplyr
has a number of functions with “join” in the name that are used to bring data from two sources together into one. The basic syntax is
|>
LeftData some_join(RightData, ...)
where some_join()
is one of
inner_join()
: return all rows fromLeftData
where there are matching values inData2
, and all columns fromLeftData
andRightData
. If there are multiple matches betweenLeftData
andRightData
, all combination of the matches are returned.left_join()
: return all rows fromLeftData
, and all columns fromLeftData
andRightData
. Rows inLeftData
with no match inRightData
will have NA values in the new columns. If there are multiple matches betweenLeftData
andRightData
, all combinations of the matches are returned.right_join()
: return all rows fromRightData
, and all columns fromLeftData
andRightData
. Rows inRightData
with no match inLeftData
will have NA values in the new columns. If there are multiple matches betweenLeftData
andRightData
, all combinations of the matches are returned.semi_join()
: return all rows fromLeftData
where there are matching values inRightData
, keeping just columns fromLeftData
. A semi join differs from an inner join because an inner join will return one row ofLeftData
for each matching row ofRightData
, where a semi join will never duplicate rows ofLeftData
.anti_join()
: return all rows fromLeftData
where there are not matching values inRightData
, keeping just columns fromLeftData
.
full_join()
: return all rows and all columns from both LeftData
and RightData
. Where there are not matching values, returns NA for the one missing.
Using anti_join() to detect joining problems
We can recode the countries in population
to match the names in who
using tidyr::case_when()
.
Let’s save this with a new name so we can reuse it below.
We want a left_join()
of our modified who
data with the population
data. By default, matching is on all columns that have the same names. That works just right for us in this case. When the columns for matching have different names, there are ways to specify which names in LeftData
correspond with which names in RightData
(or we can rename the variables to make them match).
A message is displayed to show which variables were used for matching. It’s good to check that those are what you expect.
Exercise
Use the joined data below to make a plot showing the distribution of TB rates.
|>
who setNames(sub("newrel", "new_rel", names(who))) |>
pivot_longer(names_to = "code", values_to = "new_cases", matches("new")) |>
separate(code, c("status", "diagnosis", "sexage"), sep = "_") |>
separate(sexage, c("sex", "age_group"), sep = 1) |>
left_join(population2) |>
group_by(country, year) |>
summarise(
total_cases = sum(new_cases, na.rm = TRUE),
tb_rate = total_cases / max(population)
|>
) filter(year == 2010) |>
gf_dens(~ tb_rate)
Exercise
Use the joined data below to locate the ten countries with the highest average TB rate over the years for which they reported data.
Plot the TB rates over time for these ten countries.
|>
who setNames(sub("newrel", "new_rel", names(who))) |>
pivot_longer(names_to = "code", values_to = "new_cases", matches("new")) |>
separate(code, c("status", "diagnosis", "sexage"), sep = "_") |>
separate(sexage, c("sex", "age_group"), sep = 1) |>
left_join(population2) |>
group_by(country, iso3, year) |>
summarise(
total_cases = sum(new_cases, na.rm = TRUE),
tb_rate = total_cases / max(population)
|>
) group_by(country, iso3) |>
mutate(mean_tb_rate = mean(tb_rate, na.rm = TRUE)) |>
ungroup() |>
arrange(-mean_tb_rate) |>
head(10 * (2013 - 1979)) |>
gf_line(tb_rate ~ year, color = ~ iso3) |>
gf_lims(x = c(1990, NA))
A few other things
We didn’t need them in this case study, but here are some additional functions that can be useful for tidying up your data.
Fancy table output
There are a number of packages that provide fancier output of data tables. Some only work in certain document formats. Some are highly customizable.
Here is an example of a customized table output with {gt}
. We have changed the number formatting for population figures and added a number of interactive features. See the documentation for details and additional features.
Extracting values from text
readr::parse_number()
, readr::parse_integer()
, readr::parse_date()
, etc.: These functions read through the values in a variable and do there best to extract the type of information requested. The R ingestion functions will typically turn what you expect to be numeric values into text if there are any values that it cannot interpret as a number. So this can be very handy when bringing in data created in other software.
Exercises
Enter some other types of text to see how
parse_number()
converts things.Try changing
parse_number()
to one of the other parsers.
Add
na = c("", "-")
to theparse_number()
call above. What does that do?Dates are tricky because there are so many formats, and a date can be ambiguous without knowledge of the format used. (What date is 10-11-12?)
lubridate()
provides functions likemdy()
,ymd()
, etc. that allow us to specify the order of year, month, and day. (Addhms
if you also have times to process.) Editsome_dates
below to see what formats are correctly parsed.
The lubridate
package has many other utilities for working with dates and times.