This page is a follow-on from the one about bar plots with single factors
If you’re going to be using ggplot2, the first thing you need to do is load the library:
library(ggplot2)
Take a look at the dataset below which contains the results of a sleep experiment (it shows the number of extra hours of sleep - relative to a control group - that 10 participants experienced after taking medicine “1” compared to after taking medicine “2”):
print(sleep)
## extra group ID
## 1 0.7 1 1
## 2 -1.6 1 2
## 3 -0.2 1 3
## 4 -1.2 1 4
## 5 -0.1 1 5
## 6 3.4 1 6
## 7 3.7 1 7
## 8 0.8 1 8
## 9 0.0 1 9
## 10 2.0 1 10
## 11 1.9 2 1
## 12 0.8 2 2
## 13 1.1 2 3
## 14 0.1 2 4
## 15 -0.1 2 5
## 16 4.4 2 6
## 17 5.5 2 7
## 18 1.6 2 8
## 19 4.6 2 9
## 20 3.4 2 10
The ‘results’ of the experiment are in column “extra”, namely the number of extra hours of sleep for each participant for each medicine. The ‘factors’ are in the other two columns, “group” (ie which medicine was taken) and “ID” (ie the ID of the participant). If we try to plot this as a bar plot as per normal we do not get the full picture of the experiment:
p <- ggplot(sleep, aes(x = ID, y = extra))
p <- p + geom_bar(stat = "identity")
print(p)
As you can see, there is only one bar for each participant! We’re expecting two bars; one for each of the two times they repeated the experiment. Using colour to differentiate the two experimental runs will help to see what’s going wrong:
p <- ggplot(sleep, aes(x = ID, y = extra, fill = factor(group)))
p <- p + geom_bar(stat = "identity")
print(p)
So, what’s going on here is that there are indeed two bars for each participant, but the first (reddish-pink) one is behind the second (blueish-green) one. What we need to do is have them be side-by-side, ie for them to ‘dodge’ each other:
p <- ggplot(sleep, aes(x = ID, y = extra, fill = factor(group)))
p <- p + geom_bar(stat = "identity", position = position_dodge())
print(p)
As you can see, this was achieved by using the position_dodge()
function and the “position” keyword argument.
Let’s improve how the plot looks:
p <- ggplot(sleep, aes(x = ID, y = extra, fill = factor(group)))
p <- p + geom_bar(stat = "identity", position = position_dodge())
p <- p + scale_fill_manual(values = c("#F4A582", "#92C5DE"))
p <- p + xlab("Participant ID")
p <- p + labs(
title = "Student's Sleep Experiment", y = "Additional Sleep Time [hr]",
fill = "Medicine"
)
print(p)
Notice that each bar is currently representing an exact number: the height of each corresponds to a single value, the number of extra hours the participant slept for. The concept of adding errors bars to this plot doesn’t work; you can’t calculate a standard error on one number!
If we calculate the summary statistics for each factor, however, that will reduce the number of factors by one and cause each bar to represent all of the data points corresponding to that group. This can be done with the summarySE()
function from the “Rmisc” package:
library(Rmisc)
# "measurevar" is the variable being measured
# "groupvars" are the variables representing the groups
sleep_summ <- summarySE(sleep, measurevar = "extra", groupvars = "group")
print(sleep_summ)
## group N extra sd se ci
## 1 1 10 0.75 1.789010 0.5657345 1.279780
## 2 2 10 2.33 2.002249 0.6331666 1.432322
As you can see from the above data frame, we now only have one factor (group) with 10 data points (N) in each. The mean value for each group has bee calculated (extra) along with the standard error (se). This can now all be plotted:
p <- ggplot(sleep_summ, aes(x = group, y = extra, fill = factor(group)))
p <- p + geom_bar(stat = "identity", position = position_dodge())
p <- p + geom_errorbar(aes(ymax = extra + se, ymin = extra - se), width = 0.2)
# Bar fill colours
p <- p + scale_fill_manual(values = c("#F4A582", "#92C5DE"))
# Titles and labels
p <- p + xlab("Medicine")
p <- p + labs(
title = "Student's Sleep Experiment", y = "Additional Sleep Time [hr]"
)
# Remove legend
p <- p + theme(legend.position = "none")
print(p)
The Titanic dataset details the number of passengers that were on board the famous passenger ship that sunk in 1912. It contains one ‘result’ (“Freq” - the number of each type of passenger) and four ‘factors’ (“Class”, “Sex”, “Age” and “Survived”). The first 15 rows are as follows:
# Convert to data frame
titanic <- as.data.frame(Titanic)
print(head(titanic, 15))
## Class Sex Age Survived Freq
## 1 1st Male Child No 0
## 2 2nd Male Child No 0
## 3 3rd Male Child No 35
## 4 Crew Male Child No 0
## 5 1st Female Child No 0
## 6 2nd Female Child No 0
## 7 3rd Female Child No 17
## 8 Crew Female Child No 0
## 9 1st Male Adult No 118
## 10 2nd Male Adult No 154
## 11 3rd Male Adult No 387
## 12 Crew Male Adult No 670
## 13 1st Female Adult No 4
## 14 2nd Female Adult No 13
## 15 3rd Female Adult No 89
There are too many factors to plot all at once; on a 2D graph only two can be shown at once. That’s no problem though because we can just make four graphs:
p <- ggplot(
titanic, aes(x = Age, y = Freq, fill = factor(Class))
)
p <- p + geom_bar(stat = "identity", position = position_dodge())
p <- p + scale_fill_manual(
values = c("#D1E5F0", "#92C5DE", "#4393C3", "#2166AC")
)
p <- p + xlab("Age")
p <- p + labs(
title = "Age of passengers on the Titanic",
y = "Count", fill = "Class"
)
print(p)
p <- ggplot(
titanic, aes(x = Class, y = Freq, fill = factor(Sex))
)
p <- p + geom_bar(stat = "identity", position = position_dodge())
p <- p + scale_fill_manual(
values = c("#92C5DE", "#2166AC")
)
p <- p + xlab("Class")
p <- p + labs(
title = "Class of passengers on the Titanic",
y = "Count", fill = "Gender"
)
print(p)
p <- ggplot(
titanic, aes(x = Sex, y=Freq, fill=factor(Survived))
)
p <- p + geom_bar(stat = "identity", position = position_dodge())
p <- p + scale_fill_manual(
values = c("#92C5DE", "#2166AC")
)
p <- p + xlab("Gender")
p <- p + labs(
title = "Gender of passengers on the Titanic",
y = "Count",
fill = "Survived"
)
print(p)
p <- ggplot(
titanic, aes(x = Survived, y=Freq, fill=factor(Age))
)
p <- p + geom_bar(stat = "identity", position = position_dodge())
p <- p + scale_fill_manual(
values = c("#92C5DE", "#2166AC")
)
p <- p + xlab("Survived")
p <- p + labs(
title = "Survival of passengers on the Titanic",
y = "Count", fill = "Age"
)
print(p)