For this page we will use data from the finals of various Olympic events, available on Wikipedia.
This page functions as a follow-on from Boxplots with One Group of Data.
If we try to plot multiple datasets (eg the results from more than one Olympic Games) with two groups within each dataset (men and women), things start to become confusing:
library(ggplot2)
long_jump <- data.frame(
olympics = c(
rep("Rio 2016 (Men)", 8), rep("Tokyo 2020 (Men)", 8),
rep("Rio 2016 (Women)", 8), rep("Tokyo 2020 (Women)", 8)
),
distance = c(
8.38, 8.37, 8.29, 8.25, 8.17, 8.10, 8.06, 8.05,
8.41, 8.41, 8.21, 8.18, 8.15, 8.10, 8.08, 7.99,
7.17, 7.15, 7.08, 6.95, 6.81, 6.79, 6.74, 6.69,
7.00, 6.97, 6.97, 6.91, 6.88, 6.84, 6.83, 6.80
)
)
p <- ggplot(long_jump, aes(olympics, distance))
p <- p + geom_boxplot()
print(p)
In the above figure we just have four distinct box plots without the groups having been separated out.
It’s a good idea to use colour, a legend and the ordering of the box plots to show the different data sets. This will help to make things as clear as possible:
long_jump <- data.frame(
olympics = c(
rep("Rio 2016", 8), rep("Tokyo 2020", 8),
rep("Rio 2016", 8), rep("Tokyo 2020", 8)
),
men_women = c(rep("Men", 16), rep("Women", 16)),
distance = c(
8.38, 8.37, 8.29, 8.25, 8.17, 8.10, 8.06, 8.05,
8.41, 8.41, 8.21, 8.18, 8.15, 8.10, 8.08, 7.99,
7.17, 7.15, 7.08, 6.95, 6.81, 6.79, 6.74, 6.69,
7.00, 6.97, 6.97, 6.91, 6.88, 6.84, 6.83, 6.80
)
)
p <- ggplot(long_jump, aes(olympics, distance))
p <- p + geom_boxplot(aes(fill = men_women))
print(p)
An improvement to the above graph would be to swap the grouping: plot by gender first and then by Olympics rather than Olympics first and gender second:
long_jump <- data.frame(
olympics = c(
rep("Beijing 2008", 8), rep("London 2012", 8), rep("Rio 2016", 8), rep("Tokyo 2020", 8),
rep("Beijing 2008", 8), rep("London 2012", 8), rep("Rio 2016", 8), rep("Tokyo 2020", 8)
),
men_women = c(rep("Men", 32), rep("Women", 32)),
distance = c(
8.34, 8.24, 8.20, 8.19, 8.19, 8.16, 8.07, 8.00,
8.31, 8.16, 8.12, 8.11, 8.10, 8.07, 8.01, 7.93,
8.38, 8.37, 8.29, 8.25, 8.17, 8.10, 8.06, 8.05,
8.41, 8.41, 8.21, 8.18, 8.15, 8.10, 8.08, 7.99,
7.04, 7.03, 6.91, 6.79, 6.76, 6.70, 6.64, 6.58,
7.12, 7.07, 6.89, 6.88, 6.77, 6.76, 6.72, 6.67,
7.17, 7.15, 7.08, 6.95, 6.81, 6.79, 6.74, 6.69,
7.00, 6.97, 6.97, 6.91, 6.88, 6.84, 6.83, 6.80
)
)
p <- ggplot(long_jump, aes(men_women, distance))
p <- p + geom_boxplot(aes(fill = olympics))
print(p)
Improve the graph further by adding a title and changing the labels and colours:
ggtitle()
controls the graph titleylab()
sets the y-axis labelxlab()
sets the x-axis labellabs()
can be used to change the title of the legendscale_fill_manual()
will manually change the fill colours in both the boxes themselves and also in the corresponding icons in the legendp <- ggplot(long_jump, aes(men_women, distance))
p <- p + geom_boxplot(aes(fill = olympics))
p <- p + ggtitle("Long Jump Finals at the Last Four Olympic Games")
p <- p + ylab("Distance [m]")
p <- p + xlab("")
p <- p + labs(fill = "Olympics")
p <- p + scale_fill_manual(values = c("#0074c5", "#d71921", "#e400a3", "#00a650"))
print(p)
It can be very useful to include summary statistics right on the plot as opposed to hidden away in a paragraph of text. This requires custom functions which then get called from within the stat_summary()
function:
# Custom functions
sample_size <- function(x) {
# Sample size labels
return(
data.frame(y = median(x) * 1.06, label = paste0("n=", length(x)))
)
}
sample_mean <- function(x) {
# Sample mean labels
return(
data.frame(
y = median(x) * 1.05,
label = paste0("\U00B5=", round(mean(x), 2), "m")
)
)
}
p <- ggplot(long_jump, aes(men_women, distance, fill = olympics))
p <- p + geom_boxplot()
p <- p + ggtitle("Long Jump Finals at the Last Four Olympic Games")
p <- p + ylab("Distance [m]")
p <- p + xlab("")
p <- p + labs(fill = "Olympics")
p <- p + scale_fill_manual(values = c("#0074c5", "#d71921", "#e400a3", "#00a650"))
p <- p + stat_summary(
fun.data = sample_size, geom = "text", fun = median,
position = position_dodge(width = 0.75), size = 3
)
p <- p + stat_summary(
fun.data = sample_mean, geom = "text", fun = median,
position = position_dodge(width = 0.75), size = 3
)
print(p)
Instead of representing the multiple groups via multiple box plots we could instead have one box plot and overlay the individual points in a jitter plot (a type of scatter plot where the points have been moved off-centre by a random amount - ‘jittered’ - in order to improve their visibility). Adding colour-coding will then ensure that the sub-groups are still distinguishable:
p <- ggplot(long_jump, aes(x = men_women, y = distance))
p <- p + geom_boxplot()
p <- p + geom_jitter(aes(color = factor(olympics)), width = 0.2)
p <- p + ggtitle("Long Jump Finals at the Last Four Olympic Games")
p <- p + ylab("Distance [m]")
p <- p + xlab("")
p <- p + labs(color = "Olympics")
p <- p + scale_color_manual(values = c("#0074c5", "#d71921", "#e400a3", "#00a650"))
print(p)
There are a couple of differences between this code and previous examples:
aes()
function there is no fill
argument being used. This stops the data from being split into one box plot per Olympics.geom_jitter()
function is included. This spreads the data points in the scatter plots out to make them more visible, and the color
argument causes them to be coloured by Olympics.labs()
function, the label for the color
is being set, not the fill
as has been done previously. This is because we are using the colour of the points to distinguish the groups, not the fill colour of the boxes.scale_color_manual()
function is needed to control the colour of the points, as opposed to scale_fill_manual()
This is another way of doing the above, but using the geom_dotplot()
function instead of the geom_jitter()
function. The stackdir = "center"
option ensures that points with similar values remain stacked horizontally instead of hiding behind one another:
p <- ggplot(long_jump, aes(x = men_women, y = distance, color = olympics))
p <- p + geom_boxplot(
outlier.colour = "black", outlier.shape = 16, outlier.size = 0.5,
notch = FALSE, aes(fill = olympics)
)
p <- p + geom_dotplot(
binaxis = "y", stackdir = "center", dotsize = 0.4,
position = position_dodge(0.75), binwidth = 0.05,
show.legend = FALSE
)
p <- p + ggtitle("Long Jump Finals at the Last Four Olympic Games")
p <- p + scale_color_manual(values = c("#B2182B", "#2166AC", "#B2182B", "#2166AC"), guide=FALSE)
p <- p + scale_fill_manual(values = c("#F4A582", "#92C5DE", "#F4A582", "#92C5DE"))
p <- p + theme_light()
p <- p + scale_x_discrete(limits=c("Men", "Women"))
p <- p + ylab("Distance [m]")
p <- p + xlab("")
p <- p + labs(fill = "Olympics")
print(p)
Finally, save your plot to your computer as an image using one of the following (depending on what format you want the image to be in):
png("File Name.png")
pdf("File Name.pdf")
ggsave("File Name.png")
If you use one of the first two, it must come before you start plotting the graph (ie before you call ggplot()
). If you use the last one (ggsave()
) it must come after you’ve plotted the graph.