A “pivot table” is a way of summarising the data in a table. For example, if we have the following results of three tests that were taken by two students:
Student | Test | Score |
---|---|---|
A | Test 1 | 66 |
B | Test 1 | 76 |
A | Test 2 | 80 |
B | Test 2 | 77 |
A | Test 3 | 59 |
B | Test 3 | 61 |
Then a pivot table would look as follows:
Student | Test 1 | Test 2 | Test 3 |
---|---|---|---|
A | 66 | 80 | 59 |
B | 76 | 77 | 61 |
As you can see, the first table was long while the second was wide. That’s another way of talking about this type of operation: going from long to wide or, if you want to go the opposite way, from wide to long.
For this example we will use the “Gapminder” dataset, which can be installed as follows:
install.packages("gapminder", repos = "http://cran.us.r-project.org")
This dataset contains the life expectancy, population and GDP per capita of 142 countries as recorded in various years:
# Import the gapminder as a tibble
library(gapminder)
# Convert tibble to data frame
gapminder <- as.data.frame(gapminder)
# Display
head(gapminder, n = 20)
cat("\n\t\t\t\t\t\t\t...\n\n")
tail(gapminder)
## country continent year lifeExp pop gdpPercap
## 1 Afghanistan Asia 1952 28.801 8425333 779.4453
## 2 Afghanistan Asia 1957 30.332 9240934 820.8530
## 3 Afghanistan Asia 1962 31.997 10267083 853.1007
## 4 Afghanistan Asia 1967 34.020 11537966 836.1971
## 5 Afghanistan Asia 1972 36.088 13079460 739.9811
## 6 Afghanistan Asia 1977 38.438 14880372 786.1134
## 7 Afghanistan Asia 1982 39.854 12881816 978.0114
## 8 Afghanistan Asia 1987 40.822 13867957 852.3959
## 9 Afghanistan Asia 1992 41.674 16317921 649.3414
## 10 Afghanistan Asia 1997 41.763 22227415 635.3414
## 11 Afghanistan Asia 2002 42.129 25268405 726.7341
## 12 Afghanistan Asia 2007 43.828 31889923 974.5803
## 13 Albania Europe 1952 55.230 1282697 1601.0561
## 14 Albania Europe 1957 59.280 1476505 1942.2842
## 15 Albania Europe 1962 64.820 1728137 2312.8890
## 16 Albania Europe 1967 66.220 1984060 2760.1969
## 17 Albania Europe 1972 67.690 2263554 3313.4222
## 18 Albania Europe 1977 68.930 2509048 3533.0039
## 19 Albania Europe 1982 70.420 2780097 3630.8807
## 20 Albania Europe 1987 72.000 3075321 3738.9327
##
## ...
##
## country continent year lifeExp pop gdpPercap
## 1699 Zimbabwe Africa 1982 60.363 7636524 788.8550
## 1700 Zimbabwe Africa 1987 62.351 9216418 706.1573
## 1701 Zimbabwe Africa 1992 60.377 10704340 693.4208
## 1702 Zimbabwe Africa 1997 46.809 11404948 792.4500
## 1703 Zimbabwe Africa 2002 39.989 11926563 672.0386
## 1704 Zimbabwe Africa 2007 43.487 12311143 469.7093
As you can see, this is a long dataset. To pivot it and make it wide, you have two options:
The spread()
function allows us to choose which variable to use as the column headings and which to use as the data in the table:
library(tidyr)
# Select the columns we are interested in
long <- gapminder[c("country", "continent", "year", "pop")]
# Pivot
wide <- spread(long, year, pop)
Show the population in each country in each year sampled:
head(wide[, 1:12])
## country continent 1952 1957 1962 1967 1972 1977 1982 1987 1992 1997
## 1 Afghanistan Asia 8425333 9240934 10267083 11537966 13079460 14880372 12881816 13867957 16317921 22227415
## 2 Albania Europe 1282697 1476505 1728137 1984060 2263554 2509048 2780097 3075321 3326498 3428038
## 3 Algeria Africa 9279525 10270856 11000948 12760499 14760787 17152804 20033753 23254956 26298373 29072015
## 4 Angola Africa 4232095 4561361 4826015 5247469 5894858 6162675 7016384 7874230 8735988 9875024
## 5 Argentina Americas 17876956 19610538 21283783 22934225 24779799 26983828 29341374 31620918 33958947 36203463
## 6 Australia Oceania 8691212 9712569 10794968 11872264 13177000 14074100 15184200 16257249 17481977 18565243
The cast()
function allows us to do the same thing without needing to first trim down to just the columns we want:
library(reshape)
wide <- cast(gapminder, country ~ year, value = "pop")
head(wide[, 1:12])
## country 1952 1957 1962 1967 1972 1977 1982 1987 1992 1997 2002
## 1 Afghanistan 8425333 9240934 10267083 11537966 13079460 14880372 12881816 13867957 16317921 22227415 25268405
## 2 Albania 1282697 1476505 1728137 1984060 2263554 2509048 2780097 3075321 3326498 3428038 3508512
## 3 Algeria 9279525 10270856 11000948 12760499 14760787 17152804 20033753 23254956 26298373 29072015 31287142
## 4 Angola 4232095 4561361 4826015 5247469 5894858 6162675 7016384 7874230 8735988 9875024 10866106
## 5 Argentina 17876956 19610538 21283783 22934225 24779799 26983828 29341374 31620918 33958947 36203463 38331121
## 6 Australia 8691212 9712569 10794968 11872264 13177000 14074100 15184200 16257249 17481977 18565243 19546792
Of course, we do want to keep the “continent” column there:
wide <- cast(gapminder, country + continent ~ year, value = "pop")
head(wide[, 1:12])
## country continent 1952 1957 1962 1967 1972 1977 1982 1987 1992 1997
## 1 Afghanistan Asia 8425333 9240934 10267083 11537966 13079460 14880372 12881816 13867957 16317921 22227415
## 2 Albania Europe 1282697 1476505 1728137 1984060 2263554 2509048 2780097 3075321 3326498 3428038
## 3 Algeria Africa 9279525 10270856 11000948 12760499 14760787 17152804 20033753 23254956 26298373 29072015
## 4 Angola Africa 4232095 4561361 4826015 5247469 5894858 6162675 7016384 7874230 8735988 9875024
## 5 Argentina Americas 17876956 19610538 21283783 22934225 24779799 26983828 29341374 31620918 33958947 36203463
## 6 Australia Oceania 8691212 9712569 10794968 11872264 13177000 14074100 15184200 16257249 17481977 18565243
The reshape2 package has the acast()
and the dcast()
functions: acast()
returns its output as a vector, matrix or array while dcast()
returns it as a data frame. They are otherwise identical to cast()
from the reshape package, although they use value.var
instead of value
as the keyword for the value parameter:
library(reshape2)
wide <- dcast(gapminder, country + continent ~ year, value.var = "pop")
head(wide[, 1:12])
## country continent 1952 1957 1962 1967 1972 1977 1982 1987 1992 1997
## 1 Afghanistan Asia 8425333 9240934 10267083 11537966 13079460 14880372 12881816 13867957 16317921 22227415
## 2 Albania Europe 1282697 1476505 1728137 1984060 2263554 2509048 2780097 3075321 3326498 3428038
## 3 Algeria Africa 9279525 10270856 11000948 12760499 14760787 17152804 20033753 23254956 26298373 29072015
## 4 Angola Africa 4232095 4561361 4826015 5247469 5894858 6162675 7016384 7874230 8735988 9875024
## 5 Argentina Americas 17876956 19610538 21283783 22934225 24779799 26983828 29341374 31620918 33958947 36203463
## 6 Australia Oceania 8691212 9712569 10794968 11872264 13177000 14074100 15184200 16257249 17481977 18565243
Something that’s useful to remember is that the default aggregation function for dcast()
is length
. In other words, it will return a count of the number of elements relevant to the row and column headings by default. In this example, it gives the number of countries for which we have data at each time point and from each continent:
wide <- dcast(gapminder, continent ~ year)
print(wide)
## continent 1952 1957 1962 1967 1972 1977 1982 1987 1992 1997 2002 2007
## 1 Africa 52 52 52 52 52 52 52 52 52 52 52 52
## 2 Americas 25 25 25 25 25 25 25 25 25 25 25 25
## 3 Asia 33 33 33 33 33 33 33 33 33 33 33 33
## 4 Europe 30 30 30 30 30 30 30 30 30 30 30 30
## 5 Oceania 2 2 2 2 2 2 2 2 2 2 2 2
Unsurprisingly, this is the opposite operation: going from a wide table to a long one.
# Either explicitly define which columns to use
gathercols <- c(
"1952", "1957", "1962", "1967", "1972", "1977",
"1982", "1987", "1992", "1997", "2002", "2007"
)
# Or programmatically extract them from the column names
gathercols <- colnames(wide)[3:length(colnames(wide))]
# Convert from wide format to long format
long <- gather(wide, "year", "pop", all_of(gathercols), factor_key = TRUE)
head(long)
## continent 1952 year pop
## 1 Africa 52 1957 52
## 2 Americas 25 1957 25
## 3 Asia 33 1957 33
## 4 Europe 30 1957 30
## 5 Oceania 2 1957 2
## 6 Africa 52 1962 52
Lastly, we can generate summary statistics for the data in the table at the same time as we are pivoting it. For this example, let’s use the ChickWeight
pre-loaded dataset which details the weights of chicks on different diets at different ages:
head(ChickWeight)
## weight Time Chick Diet
## 1 42 0 1 1
## 2 51 2 1 1
## 3 59 4 1 1
## 4 64 6 1 1
## 5 76 8 1 1
## 6 93 10 1 1
Here’s how we can pivot the table and end up with the median weight of each chick over the time points:
chick_weights <- aggregate(
weight ~ Chick + Diet, data = ChickWeight, FUN = median
)
chick_weights$Chick <- as.numeric(as.character(chick_weights$Chick))
chick_weights <- chick_weights[order(chick_weights$Chick), ]
head(chick_weights)
## Chick Diet weight
## 15 1 1 99.5
## 17 2 1 112.5
## 14 3 1 107.0
## 11 4 1 94.5
## 18 5 1 123.5
## 12 6 1 132.5