When you download and open R for the first time you have access to all of the functions in the base library. These include all the ones you use most often. A complete list can be found on this page.
Here are some of the functions included in base R:
# Replace all instances of a substring in a string
gsub("2", "to", "2 be or not 2 be")
# Replace the first instance of a substring in a string
sub("quest", "question", "that is the quest")
# Create a file path by adding in "/" on macOS and Linux and "\" on Windows
# between the words
file.path("Desktop", "New Folder", "Memes")
# List the files in a folder
list.files(".")
# Create a vector
c(12, 18, 15, 13)
# Take the mean (average) of the given numbers
mean(c(12, 18, 15, 13))
You can do a lot with the functions in base R, but often you need to do something that’s more complicated. For these occasions, you’ll need to use functions from a library or package.
Let’s say you want to do some statistics using a Normal distribution and plot a graph. Base R doesn’t include anything to generate Normal data and, while it does have the plot()
function, it can be quite limited. We can do a lot more if we import the stats
and ggplot
packages:
library(stats)
library(ggplot2)
These give us access to hundreds of new functions, including dnorm()
(which produces the density of a Normal distribution at given points) and ggplot()
(which produces nice graphs):
x <- seq(-4, 4, by = 0.05)
df <- data.frame(
x = x,
y = stats::dnorm(x, mean = 0, sd = 1)
)
ggplot2::ggplot(df) + ggplot2::geom_point(aes(x, y))
Notice that in the code that generated this graph I explicitly showed which package each function came from using the double colon notation ::
. I used stats::dnorm()
to show that dnorm()
comes from the stats
library and likewise with ggplot2::ggplot()
and ggplot2::geom_point()
from ggplot2
. It’s actually not necessary to do this - R can guess which library each function comes from - but sometimes it’s useful to remind the person reading your code what libraries you are using (especially when using more obscure packages).
Take a look at the following code which records the amount of time it took four people to run 5 km and converts these into speed in km/hr:
library(chron)
distance <- 5 # km
runner_1 <- "00:18:47"
runner_2 <- "00:19:03"
runner_3 <- "00:19:20"
runner_4 <- "00:19:54"
speed_runner_1 <- as.numeric(distance / times(runner_1) / 24)
speed_runner_2 <- as.numeric(distance / times(runner_2) / 24)
speed_runner_3 <- as.numeric(distance / times(runner_3) / 24)
speed_runner_4 <- as.numeric(distance / times(runner_4) / 24)
sprintf(
"Runner 1 ran %s km in %s at a speed of %5.2f km/hr",
distance, runner_1, speed_runner_1
)
sprintf(
"Runner 2 ran %s km in %s at a speed of %5.2f km/hr",
distance, runner_2, speed_runner_2
)
sprintf(
"Runner 3 ran %s km in %s at a speed of %5.2f km/hr",
distance, runner_3, speed_runner_3
)
sprintf(
"Runner 4 ran %s km in %s at a speed of %5.2f km/hr",
distance, runner_4, speed_runner_4
)
## [1] "Runner 1 ran 5 km in 00:18:47 at a speed of 15.97 km/hr"
## [1] "Runner 2 ran 5 km in 00:19:03 at a speed of 15.75 km/hr"
## [1] "Runner 3 ran 5 km in 00:19:20 at a speed of 15.52 km/hr"
## [1] "Runner 4 ran 5 km in 00:19:54 at a speed of 15.08 km/hr"
Although the code works fine there is an awful lot of repetition! There is a principle in programming called DRY: don’t repeat yourself. In general, if you find yourself copy-pasting code from within your own script or performing the same calculation more that twice you should consider moving it to it’s own function:
time_to_speed <- function(i, time, distance) {
speed <- as.numeric(distance / times(time) / 24)
msg <- sprintf(
"Runner %s ran %s km in %s at a speed of %5.2f km/hr",
i, distance, time, speed
)
print(msg)
}
distance <- 5 # km
times <- c("00:18:47", "00:19:03", "00:19:20", "00:19:54")
for (i in seq_along(times)) {
time_to_speed(i, times[i], distance)
}
## [1] "Runner 1 ran 5 km in 00:18:47 at a speed of 15.97 km/hr"
## [1] "Runner 2 ran 5 km in 00:19:03 at a speed of 15.75 km/hr"
## [1] "Runner 3 ran 5 km in 00:19:20 at a speed of 15.52 km/hr"
## [1] "Runner 4 ran 5 km in 00:19:54 at a speed of 15.08 km/hr"
You have the same output with no repetition. This is the power of a function. Note how it has been created:
time_to_speed
which hints at the fact that you are converting time into speedfunction
is usedi
: which number runner it istime
: that runner’s time as a string in hr:min:sec formatdistance
: the length of the race in kmYou could clean the code up even further by moving the function into a separate file and importing it as if it were a library.
If you give a column of a data frame to a function, how will the function know whether you want it to act on the column as a whole (ie treat the column as one thing) or on every individual item in the column in turn (ie treat the column as multiple things)? What if you have a multi-step function with many if-else statements determining what happens? One way to deal with complicated situations like this is to define a custom function and then apply it to an input. Here’s an example that uses a subset of the built-in iris dataset:
# Create the data to be used in this example
df = iris[c(1, 2, 3, 51, 52, 53, 101, 102, 103), ]
print(df)
## Sepal.Length Sepal.Width Petal.Length Petal.Width Species
## 1 5.1 3.5 1.4 0.2 setosa
## 2 4.9 3.0 1.4 0.2 setosa
## 3 4.7 3.2 1.3 0.2 setosa
## 51 7.0 3.2 4.7 1.4 versicolor
## 52 6.4 3.2 4.5 1.5 versicolor
## 53 6.9 3.1 4.9 1.5 versicolor
## 101 6.3 3.3 6.0 2.5 virginica
## 102 5.8 2.7 5.1 1.9 virginica
## 103 7.1 3.0 5.9 2.1 virginica
Now, here’s a function that finds flowers of the ‘virginica’ species:
find_virginica <- function(val) {
if (val == "virginica") {
return(TRUE)
} else {
return(FALSE)
}
}
This function can be applied to the ‘Species’ row of our data frame by using the sapply()
function. The ‘s’ in ‘sapply’ is short for “string”, which indicates that this is the type of data our function can be applied to:
virginica_flowers = sapply(df[["Species"]], find_virginica)
print(virginica_flowers)
## [1] FALSE FALSE FALSE FALSE FALSE FALSE TRUE TRUE TRUE
It has correctly identified that only the last three rows refer to virginica flowers. Now, here’s sapply being used on multiple columns at once:
for (colname in colnames(df)) {
df[[colname]] <- sapply(df[[colname]], find_virginica)
}
print(df)
## Sepal.Length Sepal.Width Petal.Length Petal.Width Species
## 1 FALSE FALSE FALSE FALSE FALSE
## 2 FALSE FALSE FALSE FALSE FALSE
## 3 FALSE FALSE FALSE FALSE FALSE
## 51 FALSE FALSE FALSE FALSE FALSE
## 52 FALSE FALSE FALSE FALSE FALSE
## 53 FALSE FALSE FALSE FALSE FALSE
## 101 FALSE FALSE FALSE FALSE TRUE
## 102 FALSE FALSE FALSE FALSE TRUE
## 103 FALSE FALSE FALSE FALSE TRUE
The find_virginica()
function has found all the instances of virginica flowers in the table by looking at each row in each column in turn.