For this example, we will use the built-in Anscombe data set (specifically, we will use the first of Anscombe’s quartet):
# Our x-data
anscombe$x1
# Our y-data
anscombe$y1
x1 | 10.00 | 8.00 | 13.00 | 9.00 | 11.00 | 14.00 | 6.00 | 4.00 | 12.00 | 7.00 | 5.00 |
y1 | 8.04 | 6.95 | 7.58 | 8.81 | 8.33 | 9.96 | 7.24 | 4.26 | 10.84 | 4.82 | 5.68 |
In general, any two columns of a data frame that have numerical data can work.
In order to create a graph with ggplot you first need to create the plot area (with the ggplot()
function) and the aesthetic mapping (with the aes()
function):
library(ggplot2)
p <- ggplot(anscombe, aes(x = x1, y = y1))
print(p)
Then add your data. In this case, they will be added as points:
p <- p + geom_point()
print(p)
Titles can be added to the plot by using the following functions:
ggtitle()
sets the main title of the graphxlab()
sets the x-axis labelylab()
sets the y-axis labelHere’s what it looks like:
p <- p + ggtitle("Anscombe's Quartet: First Data Set")
p <- p + ylab("y-values")
p <- p + xlab("x-values")
print(p)
If you are plotting scientific or mathematical results you might need to use units or symbols that are outside of the normal Latin alphabet. In these instances you can incorporate Unicode by using the \U
Unicode indicator:
q <- p + ggtitle("Demonstration of Unicode in Axis Titles: \U03B6 \U03B5 \U03C9")
q <- q + ylab("Pi: \U03C0")
q <- q + xlab("Microseconds (\U03BCs)")
print(q)
The limits of the x- and y-axes can be changed with the xlim
and ylim
functions. Decide what you want the max and min values of the axes to be and specify these as arguments:
p <- p + xlim(0, 20)
p <- p + ylim(0, 13)
print(p)
You can also plot on a log axis if you want. Use the scale_y_continuous
function:
q <- p + scale_y_continuous(trans = "log10")
print(q)
The symbol that is used for the plot points can be changed using the shape
keyword argument in the geom_point()
function. There are 26 symbols that can be used this way. For example, if we want to use hollow circles for the plot we can do that by setting shape
to 1:
p <- ggplot(anscombe, aes(x = x1, y = y1))
p <- p + geom_point(shape = 1)
p <- p + ggtitle("Anscombe's Quartet: First Dataset")
p <- p + ylab("y-values")
p <- p + xlab("x-values")
p <- p + xlim(0, 20)
p <- p + ylim(0, 13)
print(p)
For a full list of what symbols can be used, click here.
Change the look of the plot symbols using keyword arguments in the geom_point()
function:
colour
sets either the colour of the symbol (if it only has one colour) or its border colourfill
sets the fill colour of the symbol if it has a fill coloursize
sets the size of the symbol, not including the border if there is onestroke
sets the size of the border if there is onep <- p + geom_point(shape = 21, colour = "skyblue", fill = "white", size = 1.5, stroke = 1.5)
print(p)
There are three different ways of specifying which colour you want to use:
colour = "blue"
will make the symbols blue. There are 657 colours that can be specified by name in this way, see them all here.colour = "#RRGGBB"
. Each pair RR, GG, BB is a hexadecimal number (from 00 to FF) that specifies how much red, green and blue is in the colour of your plot symbol, respectively. For example, ‘#002147’ would make your plot Oxford Blue.colour = 4
will make your symbols blue. This is because the default colour palette is: black, red, green3, blue, cyan, magenta, yellow and grey (in that order), and blue is the 4th element of that list. If you specify a number larger than 8 it will wrap around and start from the beginning again, so colour = 8
will make your symbols grey (the 8th and final colour in the default colour palette) while colour = 9
will make your symbols black (the 1st colour). If you are creating a plot that has multiple colours in it then using palettes is a good idea: instead of wasting time trying to find colours that go well together you can just use a palette and cycle through the colours in that. There are many colour palettes available although they need to be ‘loaded’ before you can use them.The easiest way to plot multiple data sets on the same axes is to convert your data into long format before plotting. Here is one way to do this using the first two data sets within the anscombe
data frame:
# Extract the first data set
group1 <- anscombe[c("x1", "y1")]
# Add the name of this group
group1$group <- "First data set"
# Standardise the column names
colnames(group1) <- c("x", "y", "group")
# Extract the second data set
group2 <- anscombe[c("x2", "y2")]
# Add the name of this group
group2$group <- "Second data set"
# Standardise the column names
colnames(group2) <- c("x", "y", "group")
# Combine the data
data <- rbind(group1, group2)
print(data)
## x y group
## 1 10 8.04 First data set
## 2 8 6.95 First data set
## 3 13 7.58 First data set
## 4 9 8.81 First data set
## 5 11 8.33 First data set
## 6 14 9.96 First data set
## 7 6 7.24 First data set
## 8 4 4.26 First data set
## 9 12 10.84 First data set
## 10 7 4.82 First data set
## 11 5 5.68 First data set
## 12 10 9.14 Second data set
## 13 8 8.14 Second data set
## 14 13 8.74 Second data set
## 15 9 8.77 Second data set
## 16 11 9.26 Second data set
## 17 14 8.10 Second data set
## 18 6 6.13 Second data set
## 19 4 3.10 Second data set
## 20 12 9.13 Second data set
## 21 7 7.26 Second data set
## 22 5 4.74 Second data set
Now you can plot it. The fact that we’re using long format means that ggplot will use different colours and add a legend automatically:
p <- ggplot(data, aes(x, y, group = group, col = group))
p <- p + geom_point()
p <- p + ggtitle("Anscombe's Quartet: First and Second Data Sets")
p <- p + ylab("y-values")
p <- p + xlab("x-values")
p <- p + xlim(0, 20)
p <- p + ylim(0, 13)
print(p)
This is where the usefulness of having different plot symbols and different plot colours becomes apparent. If both of these data sets were plotted with the same colour and had the same symbol, it would be impossible to tell them apart!
To add a ‘line of best fit’, use the geom_smooth()
function with the lm
(linear model) method. Let’s not include the se
(standard error) option:
p <- ggplot(data, aes(x, y, group = group, col = group))
p <- p + geom_point()
p <- p + geom_smooth(method = "lm", se = FALSE)
p <- p + ggtitle("Anscombe's Quartet: First and Second Data Sets")
p <- p + ylab("y-values")
p <- p + xlab("x-values")
p <- p + xlim(0, 20)
p <- p + ylim(0, 11.5)
print(p)
Notice that you can only see a blue line. This is because both data sets have the same line of best fit, and so the orange one is hidden underneath the blue one! This is why the Anscombe Quartet of data sets is interesting; despite the fact that they look so different when plotted, the statistics that describe them are almost identical!
Add an annotation that is a straight line with its label outside of the plot area. This can obviously be placed at whatever position you want it to, but we will place it at the height of the average value of the data (which, for both of the data sets, is at y = 7.50). This can be done using ‘Grobs’ - text and line objects from the ‘grid’ package:
library(grid)
# Add annotation text in the y-axis labels area
gtext <- textGrob("Mean", x = -0.04, gp = gpar(col = "red", fontsize = 8))
p <- p + annotation_custom(gtext, xmin = -Inf, xmax = Inf, ymin = 7.50, ymax = 7.50)
# Add annotation line
gline <- linesGrob(x = c(-0.005, 1), gp = gpar(col = "red", lwd = 2))
p <- p + annotation_custom(gline, xmin = -Inf, xmax = Inf, ymin = 7.50, ymax = 7.50)
# Add annotations to the plot
g <- ggplotGrob(p)
# Turn clipping off
g$layout$clip[g$layout$name == "panel"] <- "off"
grid.draw(g)
Save your plot to your computer as an image using ggsave()
. Set the dimensions and quality of the image using keyword arguments:
ggsave(
filename = "Name of Plot.png", plot = last_plot(),
width = 148, height = 105, # A6 paper size
units = "mm",
dpi = 150
)