Tidying up a data set: A Case Study with TB data

Author

Randall Pruim

Published

October 26, 2024

Tidy Data

Structure of tidy data

rows (cases/observational units) and
columns (variables).
The key is that every row is a case and every column is a variable.
No exceptions.

But what should the rows be?

That depends on the purpose.

For plotting, we often want “glyph ready data”
- one row per glyph that gets drawn
- columns contain variables that get mapped to attributes
The answer may be different at different points in an analysis (or for different layers in the same plot)

You should always know what the rows represent.

Case Study: Tidying up WHO data

The Data

The who data set in the tidyr package contains data from the World Health Organization Global Tuberculosis Report for 1995-2013. You can find out more about this data at http://www.who.int/tb/country/data/download/en/

The population data set contains global populations from 1980 to 2013.

An Inconvenient Form: Data in variables

For many purposes, the WHO data are in an inconvenient format. Each variable new_sp_m014 through newrel_f65 counts the number of new TB cases of a certain kind recorded in a given country, in a given year. The documentation says the variable names are constructed as follows:

They all begin new
The next bit is a code for the method of diagnosis
- rel = relapse, sn = negative pulmonary smear, sp = positive pulmonary smear, ep = extrapulmonary
- Note: The separating _ is missing for rel but present for the others.
After another _ there is a m or an f for male or female
This is followed directly by an age code
- 014 = 0-14 yrs of age, 1524 = 15-24 years of age, 2534 = 25 to 34 years of age, 3544 = 35 to 44 years of age, 4554 = 45 to 54 years of age, 5564 = 55 to 64 years of age, 65 = 65 years of age or older

This format is sometimes called data-in-variables format, since some of the information that we might think of as data values is stored in the names of the variables.

Discussion

For what purposes might this be tidy data?

Exercises

How many countries are in each data set (who and population)?

What countries are listed in one data set but not in the other?

Solution 1

There are a number of ways to do this. This approach gives us some additional useful information.

bind_cols(
  who |> 
    group_by(country) |>
    summarise(n = n()) |> 
    mutate(id = n():1),
  population |> 
    group_by(country) |>
    summarise(n = n()) |> 
    mutate(id = n():1)
) |>
  filter(country != country1 | id != id1)

Solution 2

This solution uses

anti_join() to find the rows that cannot be matched by country
pull() to grab only the country variable
unique() to list each country only once

who |> anti_join(population, by = "country") |> pull(country) |> unique()
population |> anti_join(who, by = "country") |> pull(country) |> unique()

Reshaping the data.

Goal:

row = a person group in a given year and country
person group described by variables country, iso2, iso3, year, sex, diagnosis, age_group
additional variable: count = number of cases

This format is good for many purposes (including subsequent reshaping and tidying).

Fixing inconsistent variable names

Let start by making the naming scheme a bit more uniform. We will replace newrel with new_rel.

setNames() sets the names on a new object

Note that setNames() returns a new object, with the names potentially modified from the original object’s names. It does not modify the object in place. So if you want the new names to persist, remember to save the result of using setNames().

Alternative method using stringr

Note: We could use stringr::str_replace() instead of sub()

pivot_longer()

pivot_longer() is used to convert the names and values of multiple columns into multiple two-column rows. In each pair, one column will contain the original names (names_to) and the other will contain the original values (values_to).

Here is a demonstration using just the 2000 observations from the USA.

Notice how columns 5 - 60 (the ones that include “new” in their names) turned into two columns (code and new_cases). Columns 1 - 4, which were not included in our list of columns, were duplicated.

If we omit the filter() command, we can process the entire data set.

Note: The inverse operation to pivot_loger() is called pivot_wider(). It takes a pair of columns and turns one column into variable names (names_from) and the other into values (values_from).

separate()

Now we need to separate our code column into separate columns, one for each piece of information coded.

And subsequently we can separate the sex from the age group using position rather than a character to do the separation.

Note: The inverse operation of separate() is unite(). (paste() is often useful as well.)

Exercises

Add on some additional code to check that things are working.

Show that the only values of sex are m and f and that they appear in equal numbers.
Do a similar thing for the diagnosis groups and age groups

Solution

# here is the solution for checking that we have equal numbers of males and females.
# you can 
who |> 
  setNames(sub("newrel", "new_rel", names(who))) |> 
  pivot_longer(names_to = "code", values_to = "new_cases", matches("new")) |>
  separate(code, c("status", "diagnosis", "sexage"), sep = "_") |>
  separate(sexage, c("sex", "age_group"), sep = 1) |>
  group_by(sex) |>
  summarise(n = n())

Construct a plot that shows the total number of new TB cases over time for each country.

That’s a lot of countries. Perhaps you might like to remake the plot using only a subset of countries. Which countries?

Solution

# Note:  This will be more useful once we can scale by total population
who |> 
  setNames(sub("newrel", "new_rel", names(who))) |> 
  pivot_longer(names_to = "code", values_to = "new_cases", matches("new")) |>
  separate(code, c("status", "diagnosis", "sexage"), sep = "_") |>
  separate(sexage, c("sex", "age_group"), sep = 1) |>
  group_by(country, iso3, year) |>
  summarise(total_cases = sum(new_cases, na.rm = TRUE)) |>
  filter(iso3 %in% c("USA", "IND", "CHN", "NGA", "DEU", "GHA")) |> 
  gf_line(total_cases ~ year, color = ~country)

Joins

Comparing TB cases to population

More interesting than the number of TB cases is the fraction of the population that is contracting TB.

who data does not include population.
the population data has population information, but nothing about TB

We need to join these data sets together so we can compute the TB rate for each country (in each year).

Both data sets have a country variable that we can use to match things up between data sets.

We will consider our modified who data to be primary data and add in population information from population.
Unfortunately, two country names are spelled differently in who and population, so we will need to address this before joining.

The join functions

dplyr has a number of functions with “join” in the name that are used to bring data from two sources together into one. The basic syntax is

LeftData |> 
  some_join(RightData, ...)

where some_join() is one of

inner_join(): return all rows from LeftData where there are matching values in Data2, and all columns from LeftData and RightData. If there are multiple matches between LeftData and RightData, all combination of the matches are returned.
left_join(): return all rows from LeftData, and all columns from LeftData and RightData. Rows in LeftData with no match in RightData will have NA values in the new columns. If there are multiple matches between LeftData and RightData, all combinations of the matches are returned.
right_join(): return all rows from RightData, and all columns from LeftData and RightData. Rows in RightData with no match in LeftData will have NA values in the new columns. If there are multiple matches between LeftData and RightData, all combinations of the matches are returned.
semi_join(): return all rows from LeftData where there are matching values in RightData, keeping just columns from LeftData. A semi join differs from an inner join because an inner join will return one row of LeftData for each matching row of RightData, where a semi join will never duplicate rows of LeftData.
anti_join(): return all rows from LeftData where there are not matching values in RightData, keeping just columns from LeftData.

full_join(): return all rows and all columns from both LeftData and RightData. Where there are not matching values, returns NA for the one missing.

Using anti_join() to detect joining problems

We can recode the countries in population to match the names in who using tidyr::case_when().

Let’s save this with a new name so we can reuse it below.

We want a left_join() of our modified who data with the population data. By default, matching is on all columns that have the same names. That works just right for us in this case. When the columns for matching have different names, there are ways to specify which names in LeftData correspond with which names in RightData (or we can rename the variables to make them match).

A message is displayed to show which variables were used for matching. It’s good to check that those are what you expect.

Exercise

Use the joined data below to make a plot showing the distribution of TB rates.

Solution

who |> 
  setNames(sub("newrel", "new_rel", names(who))) |> 
  pivot_longer(names_to = "code", values_to = "new_cases", matches("new")) |>
  separate(code, c("status", "diagnosis", "sexage"), sep = "_") |>
  separate(sexage, c("sex", "age_group"), sep = 1) |>
  left_join(population2) |> 
  group_by(country, year) |>
  summarise(
    total_cases = sum(new_cases, na.rm = TRUE),
    tb_rate = total_cases / max(population)
  ) |> 
  filter(year == 2010) |>
  gf_dens(~ tb_rate)

Exercise

Use the joined data below to locate the ten countries with the highest average TB rate over the years for which they reported data.
Plot the TB rates over time for these ten countries.

Solution

who |> 
  setNames(sub("newrel", "new_rel", names(who))) |> 
  pivot_longer(names_to = "code", values_to = "new_cases", matches("new")) |>
  separate(code, c("status", "diagnosis", "sexage"), sep = "_") |>
  separate(sexage, c("sex", "age_group"), sep = 1) |>
  left_join(population2) |> 
  group_by(country, iso3, year) |>
  summarise(
    total_cases = sum(new_cases, na.rm = TRUE),
    tb_rate = total_cases / max(population)
  ) |> 
  group_by(country, iso3) |>
  mutate(mean_tb_rate = mean(tb_rate, na.rm = TRUE)) |>
  ungroup() |>
  arrange(-mean_tb_rate) |>
  head(10 * (2013 - 1979)) |>
  gf_line(tb_rate ~ year, color = ~ iso3) |> 
  gf_lims(x = c(1990, NA))

A few other things

We didn’t need them in this case study, but here are some additional functions that can be useful for tidying up your data.

Fancy table output

There are a number of packages that provide fancier output of data tables. Some only work in certain document formats. Some are highly customizable.

Here is an example of a customized table output with {gt}. We have changed the number formatting for population figures and added a number of interactive features. See the documentation for details and additional features.

Extracting values from text

readr::parse_number(), readr::parse_integer(), readr::parse_date(), etc.: These functions read through the values in a variable and do there best to extract the type of information requested. The R ingestion functions will typically turn what you expect to be numeric values into text if there are any values that it cannot interpret as a number. So this can be very handy when bringing in data created in other software.

Exercises

Enter some other types of text to see how parse_number() converts things.
Try changing parse_number() to one of the other parsers.

Add na = c("", "-") to the parse_number() call above. What does that do?
Dates are tricky because there are so many formats, and a date can be ambiguous without knowledge of the format used. (What date is 10-11-12?) lubridate() provides functions like mdy(), ymd(), etc. that allow us to specify the order of year, month, and day. (Add hms if you also have times to process.) Edit some_dates below to see what formats are correctly parsed.

The lubridate package has many other utilities for working with dates and times.