This page is a more advanced version of the intro to descriptive statistics.
Find the basics statistics that describe a set of data points.
If your data is stored in a vector, you can simply run functions like median()
, IQR()
and so on. Because R is primarily made for statistics, these type of operations are named in a very logical and easy-to-remember way!
data <- c(1, 2, 3, 4, 5, 6, 7, 8, 10, 11)
med <- median(data)
print(med)
avg <- mean(data)
print(avg)
iqr <- IQR(data)
print(iqr)
## [1] 5.5
## [1] 5.7
## [1] 4.5
The columns of a data frame are actually vectors, so the same operations as above are also very straightforward:
df <- data.frame(
group = c(1, 1, 1, 1, 1, 1, 1, 1, 1, 1),
data = c(1, 2, 3, 4, 5, 6, 7, 8, 10, 11)
)
med <- median(df$data)
print(med)
avg <- mean(df$data)
print(avg)
iqr <- IQR(df$data)
print(iqr)
## [1] 5.5
## [1] 5.7
## [1] 4.5
We can calculate all these summary statistics in one go with the help of the describeBy()
function from the psych
package. With this function we can just throw in a data frame column and it will output a table with all the information we want, although doing so will generate a warning:
library(psych)
# Without specifying a group
# (gives a warning)
db <- psych::describeBy(df$data)
## Warning in psych::describeBy(df$data): no grouping variable requested
print(db)
## vars n mean sd median trimmed mad min max range skew kurtosis se
## X1 1 10 5.7 3.33 5.5 5.62 3.71 1 11 10 0.17 -1.46 1.05
A single value from this table can be accessed by indexing it:
print(db$median)
## [1] 5.5
If we want to avoid the warning that the above call to describeBy()
created, we need to specify a group. In our example, we have a data frame where once column is called “group” and it tells us that all the data points are in group “1”:
# Specifying a group
db <- psych::describeBy(df$data, group = df$group)
print(db)
##
## Descriptive statistics by group
## group: 1
## vars n mean sd median trimmed mad min max range skew kurtosis se
## X1 1 10 5.7 3.33 5.5 5.62 3.71 1 11 10 0.17 -1.46 1.05
You can see that no warning was generated this time, although in order to access a single value we now need to index the table twice:
print(db[[1]]$median)
## [1] 5.5
For this example we are going to use the “Gapminder” dataset which can be installed with the gapminder
package:
library(gapminder)
This dataset contains demographic information about multiple countries across different time points:
# Convert the tibble to a data frame
gapminder <- as.data.frame(gapminder)
# Preview our data
print(head(gapminder, 15))
print(tail(gapminder))
## country continent year lifeExp pop gdpPercap
## 1 Afghanistan Asia 1952 28.801 8425333 779.4453
## 2 Afghanistan Asia 1957 30.332 9240934 820.8530
## 3 Afghanistan Asia 1962 31.997 10267083 853.1007
## 4 Afghanistan Asia 1967 34.020 11537966 836.1971
## 5 Afghanistan Asia 1972 36.088 13079460 739.9811
## 6 Afghanistan Asia 1977 38.438 14880372 786.1134
## 7 Afghanistan Asia 1982 39.854 12881816 978.0114
## 8 Afghanistan Asia 1987 40.822 13867957 852.3959
## 9 Afghanistan Asia 1992 41.674 16317921 649.3414
## 10 Afghanistan Asia 1997 41.763 22227415 635.3414
## 11 Afghanistan Asia 2002 42.129 25268405 726.7341
## 12 Afghanistan Asia 2007 43.828 31889923 974.5803
## 13 Albania Europe 1952 55.230 1282697 1601.0561
## 14 Albania Europe 1957 59.280 1476505 1942.2842
## 15 Albania Europe 1962 64.820 1728137 2312.8890
## country continent year lifeExp pop gdpPercap
## 1699 Zimbabwe Africa 1982 60.363 7636524 788.8550
## 1700 Zimbabwe Africa 1987 62.351 9216418 706.1573
## 1701 Zimbabwe Africa 1992 60.377 10704340 693.4208
## 1702 Zimbabwe Africa 1997 46.809 11404948 792.4500
## 1703 Zimbabwe Africa 2002 39.989 11926563 672.0386
## 1704 Zimbabwe Africa 2007 43.487 12311143 469.7093
This dataset is a bit larger than what we need for this tutorial, so let’s trim it down a bit - to 27 rows - just to make it easier to use:
df <- subset(
gapminder,
(continent == "Oceania" | continent == "Americas") & year == 2007
)
print(df)
## country continent year lifeExp pop gdpPercap
## 60 Argentina Americas 2007 75.320 40301927 12779.380
## 72 Australia Oceania 2007 81.235 20434176 34435.367
## 144 Bolivia Americas 2007 65.554 9119152 3822.137
## 180 Brazil Americas 2007 72.390 190010647 9065.801
## 252 Canada Americas 2007 80.653 33390141 36319.235
## 288 Chile Americas 2007 78.553 16284741 13171.639
## 312 Colombia Americas 2007 72.889 44227550 7006.580
## 360 Costa Rica Americas 2007 78.782 4133884 9645.061
## 396 Cuba Americas 2007 78.273 11416987 8948.103
## 444 Dominican Republic Americas 2007 72.235 9319622 6025.375
## 456 Ecuador Americas 2007 74.994 13755680 6873.262
## 480 El Salvador Americas 2007 71.878 6939688 5728.354
## 612 Guatemala Americas 2007 70.259 12572928 5186.050
## 648 Haiti Americas 2007 60.916 8502814 1201.637
## 660 Honduras Americas 2007 70.198 7483763 3548.331
## 792 Jamaica Americas 2007 72.567 2780132 7320.880
## 996 Mexico Americas 2007 76.195 108700891 11977.575
## 1104 New Zealand Oceania 2007 80.204 4115771 25185.009
## 1116 Nicaragua Americas 2007 72.899 5675356 2749.321
## 1188 Panama Americas 2007 75.537 3242173 9809.186
## 1200 Paraguay Americas 2007 71.752 6667147 4172.838
## 1212 Peru Americas 2007 71.421 28674757 7408.906
## 1260 Puerto Rico Americas 2007 78.746 3942491 19328.709
## 1560 Trinidad and Tobago Americas 2007 69.819 1056608 18008.509
## 1620 United States Americas 2007 78.242 301139947 42951.653
## 1632 Uruguay Americas 2007 76.384 3447496 10611.463
## 1644 Venezuela Americas 2007 73.747 26084662 11415.806
A ‘discriminator’ is like a category. In this example we are going to categorise the 27 countries in our dataset as either having a “Large” or a “Small” population:
df$discriminator <- ifelse(df$pop >= 10000000, "Large", "Small")
print(df)
## country continent year lifeExp pop gdpPercap discriminator
## 60 Argentina Americas 2007 75.320 40301927 12779.380 Large
## 72 Australia Oceania 2007 81.235 20434176 34435.367 Large
## 144 Bolivia Americas 2007 65.554 9119152 3822.137 Small
## 180 Brazil Americas 2007 72.390 190010647 9065.801 Large
## 252 Canada Americas 2007 80.653 33390141 36319.235 Large
## 288 Chile Americas 2007 78.553 16284741 13171.639 Large
## 312 Colombia Americas 2007 72.889 44227550 7006.580 Large
## 360 Costa Rica Americas 2007 78.782 4133884 9645.061 Small
## 396 Cuba Americas 2007 78.273 11416987 8948.103 Large
## 444 Dominican Republic Americas 2007 72.235 9319622 6025.375 Small
## 456 Ecuador Americas 2007 74.994 13755680 6873.262 Large
## 480 El Salvador Americas 2007 71.878 6939688 5728.354 Small
## 612 Guatemala Americas 2007 70.259 12572928 5186.050 Large
## 648 Haiti Americas 2007 60.916 8502814 1201.637 Small
## 660 Honduras Americas 2007 70.198 7483763 3548.331 Small
## 792 Jamaica Americas 2007 72.567 2780132 7320.880 Small
## 996 Mexico Americas 2007 76.195 108700891 11977.575 Large
## 1104 New Zealand Oceania 2007 80.204 4115771 25185.009 Small
## 1116 Nicaragua Americas 2007 72.899 5675356 2749.321 Small
## 1188 Panama Americas 2007 75.537 3242173 9809.186 Small
## 1200 Paraguay Americas 2007 71.752 6667147 4172.838 Small
## 1212 Peru Americas 2007 71.421 28674757 7408.906 Large
## 1260 Puerto Rico Americas 2007 78.746 3942491 19328.709 Small
## 1560 Trinidad and Tobago Americas 2007 69.819 1056608 18008.509 Small
## 1620 United States Americas 2007 78.242 301139947 42951.653 Large
## 1632 Uruguay Americas 2007 76.384 3447496 10611.463 Small
## 1644 Venezuela Americas 2007 73.747 26084662 11415.806 Large
Use the group_by()
function from the dplyr
package to sort this data into two groups by the discriminator, ie sort the countries into “Large” and “Small”:
library(dplyr)
table <- dplyr::group_by(df, discriminator)
The summarise()
function together with the mean()
function will produce a table that shows the means (and only the means) of each metric (life expectancy, population and GDP per capita) for each group (Large and Small countries):
table_mean <- summarise(
table,
lifeExp = mean(lifeExp, na.rm = TRUE),
pop = mean(pop, na.rm = TRUE),
gdpPercap = mean(gdpPercap, na.rm = TRUE)
)
# Convert the tibble to a data frame
table_mean <- as.data.frame(table_mean)
print(table_mean)
## discriminator lifeExp pop gdpPercap
## 1 Large 75.70546 65153464 15964.566
## 2 Small 72.67650 5459007 9082.629
This table is in ‘wide’ format, but we can convert it into ‘long’ format using the gather()
function from the tidyr
package:
library(tidyr)
# Convert from wide to long
table_mean <- tidyr::gather(
table_mean, key = "variable", "value", lifeExp, pop, gdpPercap
)
print(table_mean)
## discriminator variable value
## 1 Large lifeExp 7.570546e+01
## 2 Small lifeExp 7.267650e+01
## 3 Large pop 6.515346e+07
## 4 Small pop 5.459007e+06
## 5 Large gdpPercap 1.596457e+04
## 6 Small gdpPercap 9.082629e+03
A similar method that leaves us with the same result but transposed (so the discriminator forms the column headings):
table_mean <- summarise(
table,
lifeExp = mean(lifeExp, na.rm = TRUE),
pop = mean(pop, na.rm = TRUE),
gdpPercap = mean(gdpPercap, na.rm = TRUE)
)
table_mean <- as.data.frame(t(table_mean))
colnames(table_mean) <- table_mean[1, ]
table_mean <- table_mean[-1, ]
print(table_mean)
## Large Small
## lifeExp 75.70546 72.67650
## pop 65153464 5459007
## gdpPercap 15964.566 9082.629
We can change what the table looks like with a bit of string formatting (using sprintf()
):
table_mean <- summarise(
table,
`Life Exp` = sprintf("%4.1f (%3.1f)", mean(lifeExp), sd(lifeExp)),
`Population` = sprintf("%4.1f (%3.1f)", mean(pop) / 10^6, sd(pop) / 10^6),
`GDP per Capita` = sprintf("%4.1f (%3.1f)", mean(gdpPercap), sd(gdpPercap))
)
table_mean <- as.data.frame(t(table_mean))
colnames(table_mean) <- table_mean[1, ]
table_mean <- table_mean[-1, ]
print(table_mean)
## Large Small
## Life Exp 75.7 (3.5) 72.7 (5.2)
## Population 65.2 (87.0) 5.5 (2.6)
## GDP per Capita 15964.6 (12869.3) 9082.6 (7100.7)
This time we are creating a table with multiple statistics (instead of just the mean), but just for the ‘population’ metric:
table_pop <- summarise(
table,
median = median(pop, na.rm = T),
n = n(),
non_na_count = sum(!is.na(pop)),
stdev = sd(pop, na.rm = T)
)
table_pop <- as.data.frame(table_pop)
print(table_pop)
## discriminator median n non_na_count stdev
## 1 Large 28674757 13 13 86976436
## 2 Small 4904620 14 14 2583115
A specific value can be extracted from the table through indexing:
print(table_pop[1, "median"])
## [1] 28674757
print(table_pop[table_pop$discriminator == "Small", "median"])
## [1] 4904620
Here it is transposed and formatted:
table_pop <- summarise(
table,
Median = sprintf("%3.1f million", median(pop) / 10^6),
n = n(),
`Non-nulls` = sum(!is.na(pop)),
`Standard deviation` = sprintf("%3.1f million", sd(pop) / 10^6)
)
table_pop <- as.data.frame(t(table_pop))
colnames(table_pop) <- table_pop[1, ]
table_pop <- table_pop[-1, ]
print(table_pop)
## Large Small
## Median 28.7 million 4.9 million
## n 13 14
## Non-nulls 13 14
## Standard deviation 87.0 million 2.6 million