⇦ Back

1 How Do I Create a Scatter Plot?

1.1 Using Vectors

The easiest way to create a scatter plot in R is by using the plot() function:

x <- c(0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10)
y <- c(0, 1, 4, 9, 16, 25, 36, 49, 64, 81, 100)
plot(x, y)

As you can see above, the x-data is a vector of numbers and the y-data is also a vector of numbers.

1.2 Using Data Frames

If you are coding in R you will mostly be working with data frames. Remember that data frames are essentially tables where the columns are vectors, so two columns of a data frames can be plotted against each other very easily! Let’s use one of the pre-loaded data frames that comes packaged with R; the Anscombe dataset:

print(anscombe)
##    x1 x2 x3 x4    y1   y2    y3    y4
## 1  10 10 10  8  8.04 9.14  7.46  6.58
## 2   8  8  8  8  6.95 8.14  6.77  5.76
## 3  13 13 13  8  7.58 8.74 12.74  7.71
## 4   9  9  9  8  8.81 8.77  7.11  8.84
## 5  11 11 11  8  8.33 9.26  7.81  8.47
## 6  14 14 14  8  9.96 8.10  8.84  7.04
## 7   6  6  6  8  7.24 6.13  6.08  5.25
## 8   4  4  4 19  4.26 3.10  5.39 12.50
## 9  12 12 12  8 10.84 9.13  8.15  5.56
## 10  7  7  7  8  4.82 7.26  6.42  7.91
## 11  5  5  5  8  5.68 4.74  5.73  6.89

As you can see there are four columns of x-values (x1, x2, x3, x4) and four columns of y-values (y1, y2, y3, y4). Let’s plot each of them:

plot(anscombe$x1, anscombe$y1)
plot(anscombe$x2, anscombe$y2)
plot(anscombe$x3, anscombe$y3)
plot(anscombe$x4, anscombe$y4)

The Anscombe Quartet is an interesting dataset to use because it consists of four plots that have near-identical descriptive statistics, despite looking very different when graphed.

2 Axis Titles

The function plot() can take keyword arguments, some of which are:

  • main which sets the main title of the graph
  • xlab which sets the x-axis label
  • ylab which sets the y-axis label

Here’s what it looks like:

plot(
    anscombe$x1, anscombe$y1,
    main = "Anscombe's Quartet: First Dataset",
    xlab = "x-values", ylab = "y-values"
)

2.1 Unicode in Axis Titles

If you are plotting scientific or mathematical results you might need to indicate units or symbols that are outside of the normal Latin alphabet. In these instances you can incorporate Unicode by using the \U Unicode indicator:

plot(
    anscombe$x1, anscombe$y1,
    main = "Demonstration of Unicode in Axis Titles: \U03B6 \U03B5 \U03C9",
    xlab = "Microseconds (\U03BCs)", ylab = "Pi: \U03C0"
)

3 Axis Options

The limits of the x- and y-axes can be changed with the xlim and ylim keyword arguments. Decide what you want the max and min values of the axes to be and specify these as vectors:

plot(
    anscombe$x1, anscombe$y1,
    main = "Anscombe's Quartet: First Dataset",
    xlab = "x-values", ylab = "y-values",
    xlim = c(0, 20), ylim = c(0, 13)
)

You can also plot on a log axis if you want. Use the log keyword argument:

plot(
    anscombe$x1, anscombe$y1,
    main = "Anscombe's Quartet: First Dataset",
    xlab = "x-values", ylab = "y-values (log scale)",
    log = "y"
)

4 Plot Symbols

The symbol that is used for the plot points can be changed using the pch (point character) keyword argument. There are 26 symbols that can be used this way. For example, if we want to use squares for the plot instead of circles we can do that by setting pch to zero:

plot(
    anscombe$x1, anscombe$y1,
    main = "Anscombe's Quartet: First Dataset",
    pch = 0
)

For a full list of what symbols can be used, click here.

5 Plot Colours

Change the colour of the plot symbols using the col keyword argument:

plot(
    anscombe$x1, anscombe$y1,
    main = "Anscombe's Quartet: First Dataset",
    col = "blue", pch = 16
)

There are three different ways of specifying which colour you want to use:

  • Type out the colour’s name, eg colour = "blue" will make the symbols blue. There are 657 colours that can be specified by name in this way, see them all here.
  • Use the RGB value of the colour you want by using the format colour = "#RRGGBB". Each pair RR, GG, BB is a hexadecimal number (from 00 to FF) that specifies how much red, green and blue is in the colour of your plot symbol, respectively. For example, ‘‎#002147’ would make your plot Oxford Blue.
  • If you specify a number then ggplot will use the corresponding colour from the colour palette that is currently loaded; for example, colour = 4 will make your symbols blue. This is because the default colour palette is: black, red, green3, blue, cyan, magenta, yellow and grey (in that order), and blue is the 4th element of that list. If you specify a number larger than 8 it will wrap around and start from the beginning again, so colour = 8 will make your symbols grey (the 8th and final colour in the default colour palette) while colour = 9 will make your symbols black (the 1st colour). If you are creating a plot that has multiple colours in it then using palettes is a good idea: instead of wasting time trying to find colours that go well together you can just use a palette and cycle through the colours in that. There are many colour palettes available although they need to be ‘loaded’ before you can use them.

6 Multiple Plots on the Same Axes

In order to plot multiple sets of data on the same axes, use the points() function for the additional datasets:

# The first dataset is plotted using plot()
plot(
    anscombe$x1, anscombe$y1,
    main = "Anscombe's Quartet: All Datasets",
    pch = 0, col = 1,
    xlab = "x-values", ylab = "y-values",
    xlim = c(0, 20), ylim = c(0, 13)
)
# The additional datasets are plotted using points()
points(anscombe$x2, anscombe$y2, pch = 1, col = 2)
points(anscombe$x3, anscombe$y3, pch = 2, col = 3)
points(anscombe$x4, anscombe$y4, pch = 3, col = 4)

This is where the usefulness of having different plot symbols and different plot colours becomes apparent. If all four of these datasets were plotted with the same colour and had the same symbol, it would be impossible to tell them apart!

6.1 Legends

With all these datasets being plotted, we’d better use a legend to identify them:

# The first dataset is plotted using plot()
plot(
    anscombe$x1, anscombe$y1,
    main = "Anscombe's Quartet: All Datasets",
    pch = 0, col = 1,
    xlab = "x-values", ylab = "y-values",
    xlim = c(0, 20), ylim = c(0, 13)
)
# The additional datasets are plotted using points()
points(anscombe$x2, anscombe$y2, pch = 1, col = 2)
points(anscombe$x3, anscombe$y3, pch = 2, col = 3)
points(anscombe$x4, anscombe$y4, pch = 3, col = 4)
# Include a legend
legend(
    15, 5,
    c("Dataset 1", "Dataset 2", "Dataset 3", "Dataset 4"),
    pch = c(0, 1, 2, 3),
    col = c(1, 2, 3, 4)
)

7 Add a Regression Line

To add a ‘line of best fit’, use the lm() (linear model) function then the abline() function (which plots a straight line between a and b). Note that the lm() function uses the tilde notation (lm(y ~ x)) and so the y-variable must go on the left of the tilde and the x-variable on the right. This is the opposite of the plot(x, y) format which requires that the x-variable comes first.

# The first dataset is plotted using plot()
plot(
    anscombe$x1, anscombe$y1,
    main = "Anscombe's Quartet: All Datasets",
    pch = 0, col = 1,
    xlab = "x-values", ylab = "y-values",
    xlim = c(0, 20), ylim = c(0, 13),
)
# The additional datasets are plotted using points()
points(anscombe$x2, anscombe$y2, pch = 1, col = 2)
points(anscombe$x3, anscombe$y3, pch = 2, col = 3)
points(anscombe$x4, anscombe$y4, pch = 3, col = 4)
# Plot lines of best fit
abline(lm(anscombe$y1 ~ anscombe$x1), col = 1)
abline(lm(anscombe$y2 ~ anscombe$x2), col = 2)
abline(lm(anscombe$y3 ~ anscombe$x3), col = 3)
abline(lm(anscombe$y4 ~ anscombe$x4), col = 4)
# Include a legend
legend(
    15, 5,
    c("Dataset 1", "Dataset 2", "Dataset 3", "Dataset 4"),
    pch = c(0, 1, 2, 3),
    col = c(1, 2, 3, 4)
)

Notice that all four lines of best fit are exactly the same! We can only see one straight line because the other three are hidden underneath it. This is what is so interesting about the Anscombe Quartet of datasets that we used; despite the fact that they look so different when plotted the statistics that describe them are almost identical!

8 More Options

There are many more options that can be tweaked in order to customise your plot and have it look exactly how you want it. Here are some of them:

  • Using the par() function to set general graphical parameters:
    • par(pin = c(width, height)) will set the dimensions of the plot in inches (so divide by 25.4 to convert to centimetres, eg par(pin = c(20 / 25.4, 15 / 25.4)) will create a graph 20 cm in width and 15 cm in height)
    • par(plt = c(xstart, xend, ystart, yend)) will set the dimensions of the graph inside of the plot area (ie it will change the amount of white space around the plot). It does this using fractions of the figure region, so par(plt = c(0.2, 0.98, 0.2, 0.85)) will have the plot area start 20% of the way in from the left-hand side of the image and run until 98% of the way to the right-hand side of the image and, similarly, it will start 20% of the way up from the bottom and end 85% of the way up to the top of the image.
  • Using additional keyword arguments inside plot():
    • plot(x, y, frame = FALSE) will remove the border around the graph
    • plot(x, y, asp = 2) will set the aspect ratio of the plot to, in this case 2:1 (height:width)

9 Save Plot

Finally, save your plot to your computer as an image using png("Name of Plot.png").

⇦ Back