Bee swarm plots enable you to see all the individual points in a dataset as well as the distribution, making them a very powerful visualisation tool.
We’ll use the built-in iris
dataset (more info here and here). This contains various measurements from samples of different species of iris flower, although for this example we’ll just use one measurement (petal length) of two of the species (versicolor and virginica):
df <- iris[
(iris["Species"] == "versicolor") |
(iris["Species"] == "virginica"),
]
df <- df[c("Petal.Length", "Species")]
# Drop unused levels
df <- droplevels(df)
print(head(df))
## Petal.Length Species
## 51 4.7 versicolor
## 52 4.5 versicolor
## 53 4.9 versicolor
## 54 4.0 versicolor
## 55 4.6 versicolor
## 56 4.5 versicolor
The beeswarm
library contains the beeswarm()
function which can be used as follows:
# Load the beeswarm library
library(beeswarm)
# Create a bee swarm plot
beeswarm(
# Plot the petal lengths as a function of the species of iris flower
df[["Petal.Length"]] ~ df[["Species"]],
# Set the main title of the plot
main = "Length of Petals of Different Species of Iris Flowers",
# Set the x- and y-axis labels
xlab = "Species", ylab = "Petal Length [cm]",
# Set the labels of the groups
labels = c("Versicolor", "Virginica")
)
To better visualise the distribution, plot a box plot first and add the bee swarm plot in over it:
boxplot(
# Plot the petal lengths as a function of the species of iris flower
df[["Petal.Length"]] ~ df[["Species"]],
# Set the main title of the plot
main = "Length of Petals of Different Species of Iris Flowers",
# Set the x- and y-axis labels
xlab = "Species", ylab = "Petal Length [cm]",
# Set the names of the groups
names = c("Versicolor", "Virginica")
)
beeswarm(
# Plot the petal lengths as a function of the species of iris flower
df[["Petal.Length"]] ~ df[["Species"]],
# Add this plot to the box plot that has just been created
add=TRUE
)
We can see from the plots that the petals of the virginica flowers are, on average, longer than those of the versicolor flowers, but is this difference statistically significant? We can do a t-test to find out:
# Significance test
stats <- t.test(df[["Petal.Length"]] ~ df[["Species"]])
The t.test()
function returns a number of things and we have assigned all of them to the stats
variable. One of the things that is returned is the best estimate (ie the mean) of the length of each species’s petals:
print(stats$estimate)
## mean in group versicolor mean in group virginica
## 4.260 5.552
The confidence interval of the difference between these means is also returned:
diff = stats$estimate[2] - stats$estimate[1]
ci_upper = abs(stats$conf.int[1])
ci_lower = abs(stats$conf.int[2])
sprintf("A difference of %.3f cm (95%% CI: %.3f - %.3f cm) between the mean petal lengths of versicolor and virginica flowers was found", diff, ci_lower, ci_upper)
## [1] "A difference of 1.292 cm (95% CI: 1.089 - 1.495 cm) between the mean petal lengths of versicolor and virginica flowers was found"
…and, of course, the p-value of the t-test is returned:
print(stats$p.value)
## [1] 4.900288e-22
This can be re-formatted as follows:
significance_strings_from_p <- function(p) {
if (p < 0.001) {
stars <- "***"
str <- " (***)"
} else if (p < 0.01) {
stars <- "**"
str <- " (**)"
} else if (p < 0.05) {
stars <- "*"
str <- " (*)"
} else if (p < 0.1) {
stars <- "."
str <- " (.)"
} else {
stars <- ""
str <- ""
}
sig <- list("stars" = stars, "str" = str)
return(sig)
}
sig <- significance_strings_from_p(stats$p.value)
sprintf("The petal length of versicolor flowers is not the same as that of virginica flowers; p = %8.2e%s", stats$p.value, sig$str)
## [1] "The petal length of versicolor flowers is not the same as that of virginica flowers; p = 4.90e-22 (***)"
These results can be added to the plot itself, allowing as much information to be communicated as efficiently as possible:
In order to have enough space for these, the height of the plot needs to be increased by 15%. Do this by adjusting the y-axis limits:
# Increase height of plot area by 15%
range <- max(df[["Petal.Length"]]) - min(df[["Petal.Length"]])
ylim <- c(min(df[["Petal.Length"]]), max(df[["Petal.Length"]]) + 0.15 * range)
# Box plot
boxplot(
df[["Petal.Length"]] ~ df[["Species"]],
main = "Length of Petals of Different Species of Iris Flowers",
xlab = "Species", ylab = "Petal Length",
ylim = ylim,
names = c("Versicolor", "Virginica")
)
# Bee swarm plot
beeswarm(
df[["Petal.Length"]] ~ df[["Species"]],
add=TRUE
)
# Text labels
text(
(1 + 2) / 2, max(df[["Petal.Length"]]) + 0.16 * range,
labels = sig$stars, cex = 1.5
)
text(
(1 + 2) / 2, max(df[["Petal.Length"]]) + 0.10 * range,
labels = sprintf("p = %8.2e", stats$p.value), cex = 0.9
)
# Horizontal line
y1 <- max(df[["Petal.Length"]]) + 0.14 * range
lines(c(1, 2), c(y1, y1))
# Vertical lines
y2 <- max(df[["Petal.Length"]]) + 0.04 * range
lines(c(1, 1), c(y1, y2))
lines(c(2, 2), c(y1, y2))
The height of the plot needs to be increased by 15% again, but this time it needs to be added in at the bottom as well as at the top:
# Increase height of plot area by 15% at the top and 15% at the bottom
range <- max(df[["Petal.Length"]]) - min(df[["Petal.Length"]])
ylim <- c(min(df[["Petal.Length"]]) - 0.15 * range, max(df[["Petal.Length"]]) + 0.15 * range)
# Box plot
bp <- boxplot(
df[["Petal.Length"]] ~ df[["Species"]],
main = "Length of Petals of Different Species of Iris Flowers",
xlab = "Species", ylab = "Petal Length",
ylim = ylim,
names = c("Versicolor", "Virginica")
)
# Bee swarm plot
beeswarm(
df[["Petal.Length"]] ~ df[["Species"]],
add=TRUE
)
# Text labels
text(
(1 + 2) / 2, max(df[["Petal.Length"]]) + 0.17 * range,
labels = sig$stars, cex = 1.5
)
text(
(1 + 2) / 2, max(df[["Petal.Length"]]) + 0.09 * range,
labels = sprintf("p = %8.2e", stats$p.value), cex = 0.9
)
# Horizontal line
y1 <- max(df[["Petal.Length"]]) + 0.14 * range
lines(c(1, 2), c(y1, y1))
# Vertical lines
y2 <- max(df[["Petal.Length"]]) + 0.04 * range
lines(c(1, 1), c(y1, y2))
lines(c(2, 2), c(y1, y2))
# Confidence intervals
# +/-1.58 IQR/sqrt(n)
for (col in seq_along(numeric(ncol(bp$conf)))) {
text(
col, min(df[["Petal.Length"]]) - 0.13 * range, cex = 0.9,
labels = paste(
"CI:",
round(bp$conf[1, col], 2), "-",
round(bp$conf[2, col], 2)
)
)
}
Finally, export your plot to your computer as an image by having png("Name of Plot.png")
at the top of your code (before you start the plotting). Various options are available to control the way this image looks:
# Export image to PNG
png(file = "Box Plot and Bee Swarm.png", width = 700, height = 700, pointsize = 24)
To save the plot as a PDF, use the following:
# Export image to PDF
cairo_pdf(file = "Box Plot and Bee Swarm.pdf", width = 8, height = 8, pointsize = 24)