Demographic tables are used to summarise the data that is being used in a study and quickly show differences between subgroups.
For this example, use the “Gapminder” data set. This contains a whole bunch of economic data like GDP, population and life expectancy from countries around the world. Conveniently, an R package has been written to interface with the data, and this can be installed and imported in the same way as any other R package:
library(gapminder)
Note that the data is in tibble format by default, so convert it into a data frame to make things less complicated:
gapminder <- as.data.frame(gapminder)
Alternatively, you can import the data directly from my website (which will save you from installing the package yourself):
data_url <- "https://raw.githubusercontent.com/rowannicholls/rowannicholls.github.io/master/gapminder.csv"
gapminder <- read.csv(url(data_url))
As this data set is quite large (it has 1,704 rows) filter it down a bit to make it easier to work with. You can do this by only extracting the data from 2002 and 2007:
gapminder <- subset(gapminder, year == 2002 | year == 2007)
Let’s take a look at the first 15 rows:
library(kableExtra)
kable_input <- kable(head(gapminder, 15))
kable_styling(kable_input, bootstrap_options = c("striped", "hover", "condensed"))
country | continent | year | lifeExp | pop | gdpPercap | |
---|---|---|---|---|---|---|
11 | Afghanistan | Asia | 2002 | 42.129 | 25268405 | 726.7341 |
12 | Afghanistan | Asia | 2007 | 43.828 | 31889923 | 974.5803 |
23 | Albania | Europe | 2002 | 75.651 | 3508512 | 4604.2117 |
24 | Albania | Europe | 2007 | 76.423 | 3600523 | 5937.0295 |
35 | Algeria | Africa | 2002 | 70.994 | 31287142 | 5288.0404 |
36 | Algeria | Africa | 2007 | 72.301 | 33333216 | 6223.3675 |
47 | Angola | Africa | 2002 | 41.003 | 10866106 | 2773.2873 |
48 | Angola | Africa | 2007 | 42.731 | 12420476 | 4797.2313 |
59 | Argentina | Americas | 2002 | 74.340 | 38331121 | 8797.6407 |
60 | Argentina | Americas | 2007 | 75.320 | 40301927 | 12779.3796 |
71 | Australia | Oceania | 2002 | 80.370 | 19546792 | 30687.7547 |
72 | Australia | Oceania | 2007 | 81.235 | 20434176 | 34435.3674 |
83 | Austria | Europe | 2002 | 78.980 | 8148312 | 32417.6077 |
84 | Austria | Europe | 2007 | 79.829 | 8199783 | 36126.4927 |
95 | Bahrain | Asia | 2002 | 74.795 | 656397 | 23403.5593 |
As you can see, we have life expectancy, population and per-capita GDP data for a whole bunch of countries for multiple timepoints. For more information about the Gapminder data set itself see here and for more information about the Gapminder R package, see its CRAN readme and documentation, its R Documentation page or its GitHub page.
Currently this data is in what’s known as wide format. This is when each variable has its own column and each row represents a ‘entry’ into the data set. The other standard format is long (or narrow) format, which is when all variables are in one column and all values are in another, and each row represents one measurement or reading. You can convert from wide to long format using the melt()
function:
# The melt() function from the reshape package converts the format from wide to long
library(reshape)
df <- melt(gapminder, id.vars = c("country", "continent", "year"))
print(head(df, 10))
## country continent year variable value
## 1 Afghanistan Asia 2002 lifeExp 42.129
## 2 Afghanistan Asia 2007 lifeExp 43.828
## 3 Albania Europe 2002 lifeExp 75.651
## 4 Albania Europe 2007 lifeExp 76.423
## 5 Algeria Africa 2002 lifeExp 70.994
## 6 Algeria Africa 2007 lifeExp 72.301
## 7 Angola Africa 2002 lifeExp 41.003
## 8 Angola Africa 2007 lifeExp 42.731
## 9 Argentina Americas 2002 lifeExp 74.340
## 10 Argentina Americas 2007 lifeExp 75.320
…and you can convert back to wide format using the cast()
function:
# The cast() function from the reshape package converts a table from long to wide
df <- cast(df, country + continent + year ~ variable, value = "value")
print(head(df, 10))
## country continent year lifeExp pop gdpPercap
## 1 Afghanistan Asia 2002 42.129 25268405 726.7341
## 2 Afghanistan Asia 2007 43.828 31889923 974.5803
## 3 Albania Europe 2002 75.651 3508512 4604.2117
## 4 Albania Europe 2007 76.423 3600523 5937.0295
## 5 Algeria Africa 2002 70.994 31287142 5288.0404
## 6 Algeria Africa 2007 72.301 33333216 6223.3675
## 7 Angola Africa 2002 41.003 10866106 2773.2873
## 8 Angola Africa 2007 42.731 12420476 4797.2313
## 9 Argentina Americas 2002 74.340 38331121 8797.6407
## 10 Argentina Americas 2007 75.320 40301927 12779.3796
If we use the cast()
function on this data frame again we get a pivot table which is where each value of a particular variable has its own column. For example, we can have each year be its own column wherein the population of each country in that year is shown:
# The cast() function from the reshape package pivots a table that is in wide format
library(reshape)
df <- cast(gapminder, country ~ year, value = "pop")
print(head(df, 10))
## country 2002 2007
## 1 Afghanistan 25268405 31889923
## 2 Albania 3508512 3600523
## 3 Algeria 31287142 33333216
## 4 Angola 10866106 12420476
## 5 Argentina 38331121 40301927
## 6 Australia 19546792 20434176
## 7 Austria 8148312 8199783
## 8 Bahrain 656397 708573
## 9 Bangladesh 135656790 150448339
## 10 Belgium 10311970 10392226
Notice that the above data frame contains only the countries, years and populations. We have lost the continent, life expectancy and GDP data when changing to this format!
In summary:
melt()
converts:
cast()
does the opposite as it converts:
In general, and for this tutorial in particular, long format is best! Additionally, always make sure that the columns correspond to the variables as opposed to the rows. Functions don’t expect data to be in a transposed format:
transposed <- t(gapminder)
print(transposed[, c(1:5)])
## 11 12 23 24 35
## country "Afghanistan" "Afghanistan" "Albania" "Albania" "Algeria"
## continent "Asia" "Asia" "Europe" "Europe" "Africa"
## year "2002" "2007" "2002" "2007" "2002"
## lifeExp "42.129" "43.828" "75.651" "76.423" "70.994"
## pop " 25268405" " 31889923" " 3508512" " 3600523" " 31287142"
## gdpPercap " 726.7341" " 974.5803" " 4604.2117" " 5937.0295" " 5288.0404"
In other words, all the data in each column should always be of the same type (all numbers or all strings).
For more info, see the page on pivot tables and the Wikipedia page on wide vs narrow data.
## country continent year variable value
## 1 Afghanistan Asia 2002 lifeExp 42.129
## 2 Afghanistan Asia 2007 lifeExp 43.828
## 3 Albania Europe 2002 lifeExp 75.651
## 4 Albania Europe 2007 lifeExp 76.423
## 5 Algeria Africa 2002 lifeExp 70.994
## 6 Algeria Africa 2007 lifeExp 72.301
Now that the data’s in the right format, let’s take a look at how to create demographic table rows by counting values. The dplyr package will be useful for this so let’s import it (and suppress its loading messages while we’re at it):
suppressMessages(library(dplyr))
We will mostly be using the summarise()
function which will allow us to summarise the data in a data frame and display it in a newly-constructed data frame more easily than, for example, the data.frame()
function. This is because the summarise()
function takes in a data frame as an input and then assumes the summarising functions you are running are meant for that, whereas with data.frame()
you need to specify what data you are using in every single function.
Get the total n for the data you are working with:
df <- summarise(gapminder, n = n())
print(df)
## n
## 1 852
The original Gapminder data set has 1,704 rows but we filtered out all but 2 years’ worth of data (reducing the size by five-sixths) then converted to long format (which increased the length by a factor of three because there are three variables). Thus we expect to be left with 852 rows because \(1704 \div 6 \times 3 = 852\).
If we consider each continent to be a group of countries’ data and implement this by using the group_by()
function, we can get the sample size for each:
df <- group_by(gapminder, continent)
groups <- summarise(df, n = n())
print(groups)
## # A tibble: 5 × 2
## continent n
## <chr> <int>
## 1 Africa 312
## 2 Americas 150
## 3 Asia 198
## 4 Europe 180
## 5 Oceania 12
Note that this counts each country twice because we have data from each country for two years. So the number of countries in each continent is actually half the numbers shown above.
Expressing these as percentages by dividing by the total number of rows:
groups <- summarise(df, n = n() / nrow(gapminder) * 100)
print(as.data.frame(groups))
## continent n
## 1 Africa 36.619718
## 2 Americas 17.605634
## 3 Asia 23.239437
## 4 Europe 21.126761
## 5 Oceania 1.408451
Changing the number formatting using the round()
function:
groups <- summarise(df, n = round(n() / nrow(gapminder) * 100, 1))
print(as.data.frame(groups))
## continent n
## 1 Africa 36.6
## 2 Americas 17.6
## 3 Asia 23.2
## 4 Europe 21.1
## 5 Oceania 1.4
And use string formatting with the sprintf()
function:
groups <- summarise(df, n = sprintf("%d (%.1f%%)", n(), n() / nrow(gapminder) * 100, 2))
## Warning: There were 5 warnings in `summarise()`.
## The first warning was:
## ℹ In argument: `n = sprintf("%d (%.1f%%)", n(), n()/nrow(gapminder) * 100, 2)`.
## ℹ In group 1: `continent = "Africa"`.
## Caused by warning in `sprintf()`:
## ! one argument not used by format '%d (%.1f%%)'
## ℹ Run `dplyr::last_dplyr_warnings()` to see the 4 remaining warnings.
print(as.data.frame(groups))
## continent n
## 1 Africa 312 (36.6%)
## 2 Americas 150 (17.6%)
## 3 Asia 198 (23.2%)
## 4 Europe 180 (21.1%)
## 5 Oceania 12 (1.4%)
A similar process can be used to count missing data in the data set. The Gapminder data set doesn’t actually have any missing data, so let’s delete some for the sake of this example:
gapminder[c(50, 51, 52), "value"] <- NA
gapminder[c(351, 352, 353), "value"] <- NA
gapminder[c(698, 699, 700, 701), "value"] <- NA
Now we can use the same functions as above:
df <- group_by(gapminder, variable)
missing_data <- summarise(df, `Missing Values` = sum(is.na(value)))
As percentages:
missing_data <- summarise(df, `Missing Values` = sum(is.na(value)) / n() * 100)
Number formatting
missing_data <- summarise(df, `Missing Values` = signif(sum(is.na(value)) / n() * 100, 2))
String formatting:
missing_data <- summarise(df, `Missing Values` = sprintf("%d (%.1f%%)", sum(is.na(value)), sum(is.na(value)) / n() * 100))
print(as.data.frame(missing_data))
## variable Missing Values
## 1 lifeExp 3 (1.1%)
## 2 pop 3 (1.1%)
## 3 gdpPercap 4 (1.4%)
If your data seems to be symmetrically-distributed, you might report its means and standard deviations. Use the na.rm = TRUE
flag to remove missing data before performing the calculation (if you don’t do this and you have missing data, your results will be NA
):
summary_statistics <- summarise(df, `Summary Statistics` = round(mean(value, na.rm = T)))
print(as.data.frame(summary_statistics))
## variable Summary Statistics
## 1 lifeExp 66
## 2 pop 43103669
## 3 gdpPercap 10698
summary_statistics <- summarise(
df,
`Summary Statistics` = sprintf("%.0f (%.0f)", round(mean(value, na.rm = T)), round(sd(value, na.rm = T)))
)
print(as.data.frame(summary_statistics))
## variable Summary Statistics
## 1 lifeExp 66 (12)
## 2 pop 43103669 (144750965)
## 3 gdpPercap 10698 (12036)
If your data is not symmetrically-distributed, you might report its medians and inter-quartile ranges:
summary_statistics <- summarise(df, `Summary Statistics` = round(median(value, na.rm = T)))
print(as.data.frame(summary_statistics))
## variable Summary Statistics
## 1 lifeExp 71
## 2 pop 10580176
## 3 gdpPercap 5652
summary_statistics <- summarise(
df,
`Summary Statistics` = sprintf("%.0f (%.0f)", round(median(value, na.rm = T)), round(IQR(value, na.rm = T)))
)
print(as.data.frame(summary_statistics))
## variable Summary Statistics
## 1 lifeExp 71 (19)
## 2 pop 10580176 (26970007)
## 3 gdpPercap 5652 (13417)
An alternative way of presenting the demographic data is to have a separate column for each thing you want to show as opposed to displaying it all in one column. Here’s the means and standard deviations in their own columns:
summary_total <- summarise(
df,
`Mean` = round(mean(value, na.rm = T)),
`Std Dev` = round(sd(value, na.rm = T))
)
print(as.data.frame(summary_total))
## variable Mean Std Dev
## 1 lifeExp 66 12
## 2 pop 43103669 144750965
## 3 gdpPercap 10698 12036
This can lead to a complication if we want to display multiple groups in a table side-by-side. As an example, here’s a demographics table with the subset of the that comes from Europe next to the subset that comes from Asia:
europe <- subset(df, continent == "Europe")
summary_europe <- summarise(europe, `Mean` = round(mean(value, na.rm = T)), `Std Dev` = round(sd(value, na.rm = T)))
asia <- subset(df, continent == "Asia")
summary_asia <- summarise(asia, `Mean` = round(mean(value, na.rm = T)), `Std Dev` = round(sd(value, na.rm = T)))
output <- merge(summary_europe, summary_asia, by = "variable")
print(output)
## variable Mean.x Std Dev.x Mean.y Std Dev.y
## 1 gdpPercap 23295 11607 11058 12609
## 2 lifeExp 77 3 70 8
## 3 pop 19973029 23696169 112329637 281092730
How can we include headers to show that columns two and three are “Europe” whilst four and five are “Asia”? It’s not really feasible to do this in the data frame itself but we can instead change how it’s rendered by using the add_header_above()
function together with the kable()
function:
library(kableExtra)
kable_input <- kable(head(output))
kable_input <- add_header_above(kable_input, c(" ", "Europe" = 2, "Asia" = 2))
kable_styling(kable_input, bootstrap_options = c("striped", "hover", "condensed"), full_width = F)
variable | Mean.x | Std Dev.x | Mean.y | Std Dev.y |
---|---|---|---|---|
gdpPercap | 23295 | 11607 | 11058 | 12609 |
lifeExp | 77 | 3 | 70 | 8 |
pop | 19973029 | 23696169 | 112329637 | 281092730 |