⇦ Back

Descriptive statistics do exactly what they say on the box: they describe the data you are working with. They are also known as summary statistics.

Take a look at the following dataset:

data <- data.frame(
    x = c(10, 8, 13, 9, 11, 14, 6, 4, 12, 7, 5),
    y = c(8.04, 6.95, 7.58, 8.81, 8.33, 9.96, 7.24, 4.26, 10.84, 4.82, 5.68)
)

Visualise it by printing it in a table and plotting it:

library(kableExtra)

kable_input <- kable(data)
kable_styling(
    kable_input, full_width = F, position = "float_left",
    bootstrap_options = c("responsive", "hover", "striped", "condensed")
)
par(pin = c(3, 3))
plot(
    data$x, data$y, main = "Scatter plot of the data on the left",
    xlim = c(1, 15), ylim = c(1, 15)
)
x y
10 8.04
8 6.95
13 7.58
9 8.81
11 8.33
14 9.96
6 7.24
4 4.26
12 10.84
7 4.82
5 5.68


Summary Statistics

Examples of “summary data” that you might be interested in include mean, median, minimum, maximum and quartiles. These can all be calculated by calling the summary() function:

summary(data)
##        x              y         
##  Min.   : 4.0   Min.   : 4.260  
##  1st Qu.: 6.5   1st Qu.: 6.315  
##  Median : 9.0   Median : 7.580  
##  Mean   : 9.0   Mean   : 7.501  
##  3rd Qu.:11.5   3rd Qu.: 8.570  
##  Max.   :14.0   Max.   :10.840

This is nice to see but we can’t use those pieces of information while they are in that format. To access each of those values individually we need to extract them using indexing:

summary <- summary(data$x)

min <- summary[1]
first_q <- summary[2]
median <- summary[3]
mean <- summary[4]
third_q <- summary[5]
max <- summary[6]

print(paste(min, first_q, median, mean, third_q, max, sep = "; "))
## [1] "4; 6.5; 9; 9; 11.5; 14"

Alternatively, we can calculate them individually instead of using the summary() function (which calculates them all together):

min <- min(data$x)
first_q <- quantile(data$x, 0.25)
median <- median(data$x)
mean <- mean(data$x)
third_q <- quantile(data$x, 0.75)
max <- max(data$x)

print(paste(min, first_q, median, mean, third_q, max, sep = "; "))
## [1] "4; 6.5; 9; 9; 11.5; 14"

Line of Best Fit

As we’ve already plotted a scatter plot we might as well add a line of best fit! And if we do that we might as well also calculate its gradient and y-intercept. The coefficient of determination (R2) of this linear regression might also be relevant. All of these things can be done with the linear model function lm():

# Note that you specify the y-data first and the x-data second, separated by a
# tilde (~):
lm(data$y ~ data$x)
## 
## Call:
## lm(formula = data$y ~ data$x)
## 
## Coefficients:
## (Intercept)       data$x  
##      3.0001       0.5001

Now you can extract the coefficients you want using the summary() function and indexing:

# Extract the y-intercept
summary(lm(data$y ~ data$x))$coefficients[1]
# Extract the gradient
summary(lm(data$y ~ data$x))$coefficients[2]
# Extract the r-squared value
summary(lm(data$y ~ data$x))$r.squared
## [1] 3.000091
## [1] 0.5000909
## [1] 0.6665425

This is what the data looks like with the line of best fit added in:

par(pin = c(3, 3))
plot(
    data$x, data$y,
    main = "The same plot as above, now with a line of best fit",
    xlim = c(1, 15), ylim = c(1, 15)
)
abline(lm(data$y ~ data$x), col = "black")
x y
10 8.04
8 6.95
13 7.58
9 8.81
11 8.33
14 9.96
6 7.24
4 4.26
12 10.84
7 4.82
5 5.68

Anscombe’s quartet

Be careful when looking at descriptive statistics because they can hide a lot of information about a dataset. This is well demonstrated by something called “Anscombe’s quartet” which is a series of four datasets (of which the above dataset is one) that all have very similar descriptive statistics despite being vastly different. Have a look below:

anscombes_quartet <- list(
    I = list(
        c(10, 8, 13, 9, 11, 14, 6, 4, 12, 7, 5),
        c(8.04, 6.95, 7.58, 8.81, 8.33, 9.96, 7.24, 4.26, 10.84, 4.82, 5.68)
    ),
    II = list(
        c(10, 8, 13, 9, 11, 14, 6, 4, 12, 7, 5),
        c(9.14, 8.14, 8.74, 8.77, 9.26, 8.1, 6.13, 3.1, 9.13, 7.26, 4.74)
    ),
    III = list(
        c(10, 8, 13, 9, 11, 14, 6, 4, 12, 7, 5),
        c(7.46, 6.77, 12.74, 7.11, 7.81, 8.84, 6.08, 5.39, 8.15, 6.42, 5.73)
    ),
    IV = list(
        c(8, 8, 8, 8, 8, 8, 8, 19, 8, 8, 8),
        c(6.58, 5.76, 7.71, 8.84, 8.47, 7.04, 5.25, 12.5, 5.56, 7.91, 6.89)
    )
)
par(pin = c(6, 6), mfrow = c(2, 2))
for (i in 1:4) {
    dataset <- anscombes_quartet[[i]]
    x <- dataset[[1]]
    y <- dataset[[2]]
    plot(
        x, y, main = sprintf("Dataset %s", i), xlim = c(1, 19),
        ylim = c(1, 19), xlab = "", ylab = ""
    )
    abline(lm(data$y ~ data$x), col = "black")
}

These are quite clearly four very different scatter plots, yet their descriptive statistics are surprisingly similar:

for (i in 1:4) {
    dataset <- anscombes_quartet[[i]]
    x <- dataset[[1]]
    y <- dataset[[2]]
    print(sprintf("Mean of x: %s", mean(x)))
    print(sprintf("Sample variance of x: %s", var(x)))
    print(sprintf("Mean of y: %.2f", mean(y)))
    print(sprintf("Sample variance of y: %.2f", var(y)))
    print(sprintf("Correlation between x and y: %.3f", cor(x, y)))
    linear_model <- lm(y ~ x)
    print(
        sprintf(
            "Linear regression line: y = %.2f + %.3fx",
            coef(linear_model)["(Intercept)"],
            coef(linear_model)["x"]
        )
    )
    print(
        sprintf(
            "Coefficient of determination: %.2f",
            summary(linear_model)$r.squared
        )
    )
    cat("\n")
}
Property Value Accuracy
Mean of x 9 exact
Sample variance of x: \(\sigma ^{2}\) 11 exact
Mean of y 7.50 to 2 decimal places
Sample variance of y: \(\sigma ^{2}\) 4.125 ± 0.003
Correlation btwn x and y 0.816 to 3 decimal places
Linear regression line y = 3.00 + 0.500x to 2 and 3 decimal places, respectively
Coefficient of determination of the linear regression: \(R^{2}\) 0.67 to 2 decimal places

⇦ Back