Demographic tables are used to summarise the data that is being used in a study and quickly show differences between subgroups.

1 Example Data

For this example, use the “Gapminder” data set. This contains a whole bunch of economic data like GDP, population and life expectancy from countries around the world. Conveniently, an R package has been written to interface with the data, and this can be installed and imported in the same way as any other R package:

library(gapminder)

Note that the data is in tibble format by default, so convert it into a data frame to make things less complicated:

gapminder <- as.data.frame(gapminder)

Alternatively, you can import the data directly from my website (which will save you from installing the package yourself):

data_url <- "https://raw.githubusercontent.com/rowannicholls/rowannicholls.github.io/master/gapminder.csv"
gapminder <- read.csv(url(data_url))

As this data set is quite large (it has 1,704 rows) filter it down a bit to make it easier to work with. You can do this by only extracting the data from 2002 and 2007:

gapminder <- subset(gapminder, year == 2002 | year == 2007)

Let’s take a look at the first 15 rows:

library(kableExtra)

kable_input <- kable(head(gapminder, 15))
kable_styling(kable_input, bootstrap_options = c("striped", "hover", "condensed"))

	country	continent	year	lifeExp	pop	gdpPercap
11	Afghanistan	Asia	2002	42.129	25268405	726.7341
12	Afghanistan	Asia	2007	43.828	31889923	974.5803
23	Albania	Europe	2002	75.651	3508512	4604.2117
24	Albania	Europe	2007	76.423	3600523	5937.0295
35	Algeria	Africa	2002	70.994	31287142	5288.0404
36	Algeria	Africa	2007	72.301	33333216	6223.3675
47	Angola	Africa	2002	41.003	10866106	2773.2873
48	Angola	Africa	2007	42.731	12420476	4797.2313
59	Argentina	Americas	2002	74.340	38331121	8797.6407
60	Argentina	Americas	2007	75.320	40301927	12779.3796
71	Australia	Oceania	2002	80.370	19546792	30687.7547
72	Australia	Oceania	2007	81.235	20434176	34435.3674
83	Austria	Europe	2002	78.980	8148312	32417.6077
84	Austria	Europe	2007	79.829	8199783	36126.4927
95	Bahrain	Asia	2002	74.795	656397	23403.5593

As you can see, we have life expectancy, population and per-capita GDP data for a whole bunch of countries for multiple timepoints. For more information about the Gapminder data set itself see here and for more information about the Gapminder R package, see its CRAN readme and documentation, its R Documentation page or its GitHub page.

2 The Data’s Format

Currently this data is in what’s known as wide format. This is when each variable has its own column and each row represents a ‘entry’ into the data set. The other standard format is long (or narrow) format, which is when all variables are in one column and all values are in another, and each row represents one measurement or reading. You can convert from wide to long format using the melt() function:

# The melt() function from the reshape package converts the format from wide to long
library(reshape)
df <- melt(gapminder, id.vars = c("country", "continent", "year"))
print(head(df, 10))

##        country continent year variable  value
## 1  Afghanistan      Asia 2002  lifeExp 42.129
## 2  Afghanistan      Asia 2007  lifeExp 43.828
## 3      Albania    Europe 2002  lifeExp 75.651
## 4      Albania    Europe 2007  lifeExp 76.423
## 5      Algeria    Africa 2002  lifeExp 70.994
## 6      Algeria    Africa 2007  lifeExp 72.301
## 7       Angola    Africa 2002  lifeExp 41.003
## 8       Angola    Africa 2007  lifeExp 42.731
## 9    Argentina  Americas 2002  lifeExp 74.340
## 10   Argentina  Americas 2007  lifeExp 75.320

…and you can convert back to wide format using the cast() function:

# The cast() function from the reshape package converts a table from long to wide
df <- cast(df, country + continent + year ~ variable, value = "value")
print(head(df, 10))

##        country continent year lifeExp      pop  gdpPercap
## 1  Afghanistan      Asia 2002  42.129 25268405   726.7341
## 2  Afghanistan      Asia 2007  43.828 31889923   974.5803
## 3      Albania    Europe 2002  75.651  3508512  4604.2117
## 4      Albania    Europe 2007  76.423  3600523  5937.0295
## 5      Algeria    Africa 2002  70.994 31287142  5288.0404
## 6      Algeria    Africa 2007  72.301 33333216  6223.3675
## 7       Angola    Africa 2002  41.003 10866106  2773.2873
## 8       Angola    Africa 2007  42.731 12420476  4797.2313
## 9    Argentina  Americas 2002  74.340 38331121  8797.6407
## 10   Argentina  Americas 2007  75.320 40301927 12779.3796

If we use the cast() function on this data frame again we get a pivot table which is where each value of a particular variable has its own column. For example, we can have each year be its own column wherein the population of each country in that year is shown:

# The cast() function from the reshape package pivots a table that is in wide format
library(reshape)
df <- cast(gapminder, country ~ year, value = "pop")
print(head(df, 10))

##        country      2002      2007
## 1  Afghanistan  25268405  31889923
## 2      Albania   3508512   3600523
## 3      Algeria  31287142  33333216
## 4       Angola  10866106  12420476
## 5    Argentina  38331121  40301927
## 6    Australia  19546792  20434176
## 7      Austria   8148312   8199783
## 8      Bahrain    656397    708573
## 9   Bangladesh 135656790 150448339
## 10     Belgium  10311970  10392226

Notice that the above data frame contains only the countries, years and populations. We have lost the continent, life expectancy and GDP data when changing to this format!

In summary:

melt() converts:
- A pivot table to wide format
- Wide format to long format
cast() does the opposite as it converts:
- Long format to wide format
- Wide format into a pivot table

In general, and for this tutorial in particular, long format is best! Additionally, always make sure that the columns correspond to the variables as opposed to the rows. Functions don’t expect data to be in a transposed format:

transposed <- t(gapminder)
print(transposed[, c(1:5)])

##           11            12            23           24           35          
## country   "Afghanistan" "Afghanistan" "Albania"    "Albania"    "Algeria"   
## continent "Asia"        "Asia"        "Europe"     "Europe"     "Africa"    
## year      "2002"        "2007"        "2002"       "2007"       "2002"      
## lifeExp   "42.129"      "43.828"      "75.651"     "76.423"     "70.994"    
## pop       "  25268405"  "  31889923"  "   3508512" "   3600523" "  31287142"
## gdpPercap "  726.7341"  "  974.5803"  " 4604.2117" " 5937.0295" " 5288.0404"

In other words, all the data in each column should always be of the same type (all numbers or all strings).

For more info, see the page on pivot tables and the Wikipedia page on wide vs narrow data.

##       country continent year variable  value
## 1 Afghanistan      Asia 2002  lifeExp 42.129
## 2 Afghanistan      Asia 2007  lifeExp 43.828
## 3     Albania    Europe 2002  lifeExp 75.651
## 4     Albania    Europe 2007  lifeExp 76.423
## 5     Algeria    Africa 2002  lifeExp 70.994
## 6     Algeria    Africa 2007  lifeExp 72.301

3 Counts

Now that the data’s in the right format, let’s take a look at how to create demographic table rows by counting values. The dplyr package will be useful for this so let’s import it (and suppress its loading messages while we’re at it):

suppressMessages(library(dplyr))

We will mostly be using the summarise() function which will allow us to summarise the data in a data frame and display it in a newly-constructed data frame more easily than, for example, the data.frame() function. This is because the summarise() function takes in a data frame as an input and then assumes the summarising functions you are running are meant for that, whereas with data.frame() you need to specify what data you are using in every single function.

3.1 Sample size

Get the total n for the data you are working with:

df <- summarise(gapminder, n = n())
print(df)

##     n
## 1 852

The original Gapminder data set has 1,704 rows but we filtered out all but 2 years’ worth of data (reducing the size by five-sixths) then converted to long format (which increased the length by a factor of three because there are three variables). Thus we expect to be left with 852 rows because \(1704 \div 6 \times 3 = 852\).

3.2 Groups

If we consider each continent to be a group of countries’ data and implement this by using the group_by() function, we can get the sample size for each:

df <- group_by(gapminder, continent)
groups <- summarise(df, n = n())
print(groups)

## # A tibble: 5 × 2
##   continent     n
##   <chr>     <int>
## 1 Africa      312
## 2 Americas    150
## 3 Asia        198
## 4 Europe      180
## 5 Oceania      12

Note that this counts each country twice because we have data from each country for two years. So the number of countries in each continent is actually half the numbers shown above.

Expressing these as percentages by dividing by the total number of rows:

groups <- summarise(df, n = n() / nrow(gapminder) * 100)
print(as.data.frame(groups))

##   continent         n
## 1    Africa 36.619718
## 2  Americas 17.605634
## 3      Asia 23.239437
## 4    Europe 21.126761
## 5   Oceania  1.408451

Changing the number formatting using the round() function:

groups <- summarise(df, n = round(n() / nrow(gapminder) * 100, 1))
print(as.data.frame(groups))

##   continent    n
## 1    Africa 36.6
## 2  Americas 17.6
## 3      Asia 23.2
## 4    Europe 21.1
## 5   Oceania  1.4

And use string formatting with the sprintf() function:

groups <- summarise(df, n = sprintf("%d (%.1f%%)", n(), n() / nrow(gapminder) * 100, 2))

## Warning: There were 5 warnings in `summarise()`.
## The first warning was:
## ℹ In argument: `n = sprintf("%d (%.1f%%)", n(), n()/nrow(gapminder) * 100, 2)`.
## ℹ In group 1: `continent = "Africa"`.
## Caused by warning in `sprintf()`:
## ! one argument not used by format '%d (%.1f%%)'
## ℹ Run `dplyr::last_dplyr_warnings()` to see the 4 remaining warnings.

print(as.data.frame(groups))

##   continent           n
## 1    Africa 312 (36.6%)
## 2  Americas 150 (17.6%)
## 3      Asia 198 (23.2%)
## 4    Europe 180 (21.1%)
## 5   Oceania   12 (1.4%)

3.3 Missing Data

A similar process can be used to count missing data in the data set. The Gapminder data set doesn’t actually have any missing data, so let’s delete some for the sake of this example:

gapminder[c(50, 51, 52), "value"] <- NA
gapminder[c(351, 352, 353), "value"] <- NA
gapminder[c(698, 699, 700, 701), "value"] <- NA

Now we can use the same functions as above:

df <- group_by(gapminder, variable)
missing_data <- summarise(df, `Missing Values` = sum(is.na(value)))

As percentages:

missing_data <- summarise(df, `Missing Values` = sum(is.na(value)) / n() * 100)

Number formatting

missing_data <- summarise(df, `Missing Values` = signif(sum(is.na(value)) / n() * 100, 2))

String formatting:

missing_data <- summarise(df, `Missing Values` = sprintf("%d (%.1f%%)", sum(is.na(value)), sum(is.na(value)) / n() * 100))
print(as.data.frame(missing_data))

##    variable Missing Values
## 1   lifeExp       3 (1.1%)
## 2       pop       3 (1.1%)
## 3 gdpPercap       4 (1.4%)

4 Summary Statistics

If your data seems to be symmetrically-distributed, you might report its means and standard deviations. Use the na.rm = TRUE flag to remove missing data before performing the calculation (if you don’t do this and you have missing data, your results will be NA):

summary_statistics <- summarise(df, `Summary Statistics` = round(mean(value, na.rm = T)))
print(as.data.frame(summary_statistics))

##    variable Summary Statistics
## 1   lifeExp                 66
## 2       pop           43103669
## 3 gdpPercap              10698

summary_statistics <- summarise(
    df,
    `Summary Statistics` = sprintf("%.0f (%.0f)", round(mean(value, na.rm = T)), round(sd(value, na.rm = T)))
)
print(as.data.frame(summary_statistics))

##    variable   Summary Statistics
## 1   lifeExp              66 (12)
## 2       pop 43103669 (144750965)
## 3 gdpPercap        10698 (12036)

If your data is not symmetrically-distributed, you might report its medians and inter-quartile ranges:

summary_statistics <- summarise(df, `Summary Statistics` = round(median(value, na.rm = T)))
print(as.data.frame(summary_statistics))

##    variable Summary Statistics
## 1   lifeExp                 71
## 2       pop           10580176
## 3 gdpPercap               5652

summary_statistics <- summarise(
    df,
    `Summary Statistics` = sprintf("%.0f (%.0f)", round(median(value, na.rm = T)), round(IQR(value, na.rm = T)))
)
print(as.data.frame(summary_statistics))

##    variable  Summary Statistics
## 1   lifeExp             71 (19)
## 2       pop 10580176 (26970007)
## 3 gdpPercap        5652 (13417)

5 Multiple Columns

An alternative way of presenting the demographic data is to have a separate column for each thing you want to show as opposed to displaying it all in one column. Here’s the means and standard deviations in their own columns:

summary_total <- summarise(
    df,
    `Mean` = round(mean(value, na.rm = T)),
    `Std Dev` = round(sd(value, na.rm = T))
)
print(as.data.frame(summary_total))

##    variable     Mean   Std Dev
## 1   lifeExp       66        12
## 2       pop 43103669 144750965
## 3 gdpPercap    10698     12036

This can lead to a complication if we want to display multiple groups in a table side-by-side. As an example, here’s a demographics table with the subset of the that comes from Europe next to the subset that comes from Asia:

europe <- subset(df, continent == "Europe")
summary_europe <- summarise(europe, `Mean` = round(mean(value, na.rm = T)), `Std Dev` = round(sd(value, na.rm = T)))

asia <- subset(df, continent == "Asia")
summary_asia <- summarise(asia, `Mean` = round(mean(value, na.rm = T)), `Std Dev` = round(sd(value, na.rm = T)))

output <- merge(summary_europe, summary_asia, by = "variable")
print(output)

##    variable   Mean.x Std Dev.x    Mean.y Std Dev.y
## 1 gdpPercap    23295     11607     11058     12609
## 2   lifeExp       77         3        70         8
## 3       pop 19973029  23696169 112329637 281092730

How can we include headers to show that columns two and three are “Europe” whilst four and five are “Asia”? It’s not really feasible to do this in the data frame itself but we can instead change how it’s rendered by using the add_header_above() function together with the kable() function:

library(kableExtra)

kable_input <- kable(head(output))
kable_input <- add_header_above(kable_input, c(" ", "Europe" = 2, "Asia" = 2))
kable_styling(kable_input, bootstrap_options = c("striped", "hover", "condensed"), full_width = F)

	Europe		Asia
variable	Mean.x	Std Dev.x	Mean.y	Std Dev.y
gdpPercap	23295	11607	11058	12609
lifeExp	77	3	70	8
pop	19973029	23696169	112329637	281092730

⇦ Back

Data Handling in R:Demographic Data Tables