1 Built-In vs Package vs Custom Functions

1.1 Built-In Functions

When you download and open R for the first time you have access to all of the functions in the base library. These include all the ones you use most often. A complete list can be found on this page.

Here are some of the functions included in base R:

# Replace all instances of a substring in a string
gsub("2", "to", "2 be or not 2 be")
# Replace the first instance of a substring in a string
sub("quest", "question", "that is the quest")
# Create a file path by adding in "/" on macOS and Linux and "\" on Windows
# between the words
file.path("Desktop", "New Folder", "Memes")
# List the files in a folder
list.files(".")
# Create a vector
c(12, 18, 15, 13)
# Take the mean (average) of the given numbers
mean(c(12, 18, 15, 13))

You can do a lot with the functions in base R, but often you need to do something that’s more complicated. For these occasions, you’ll need to use functions from a library or package.

1.2 Packages’ Functions

Let’s say you want to do some statistics using a Normal distribution and plot a graph. Base R doesn’t include anything to generate Normal data and, while it does have the plot() function, it can be quite limited. We can do a lot more if we import the stats and ggplot packages:

library(stats)
library(ggplot2)

These give us access to hundreds of new functions, including dnorm() (which produces the density of a Normal distribution at given points) and ggplot() (which produces nice graphs):

x <- seq(-4, 4, by = 0.05)
df <- data.frame(
    x = x,
    y = stats::dnorm(x, mean = 0, sd = 1)
)
ggplot2::ggplot(df) + ggplot2::geom_point(aes(x, y))

Notice that in the code that generated this graph I explicitly showed which package each function came from using the double colon notation ::. I used stats::dnorm() to show that dnorm() comes from the stats library and likewise with ggplot2::ggplot() and ggplot2::geom_point() from ggplot2. It’s actually not necessary to do this - R can guess which library each function comes from - but sometimes it’s useful to remind the person reading your code what libraries you are using (especially when using more obscure packages).

1.3 Custom Functions

Take a look at the following code which records the amount of time it took four people to run 5 km and converts these into speed in km/hr:

library(chron)

distance <- 5  # km
runner_1 <- "00:18:47"
runner_2 <- "00:19:03"
runner_3 <- "00:19:20"
runner_4 <- "00:19:54"

speed_runner_1 <- as.numeric(distance / times(runner_1) / 24)
speed_runner_2 <- as.numeric(distance / times(runner_2) / 24)
speed_runner_3 <- as.numeric(distance / times(runner_3) / 24)
speed_runner_4 <- as.numeric(distance / times(runner_4) / 24)

sprintf(
    "Runner 1 ran %s km in %s at a speed of %5.2f km/hr",
    distance, runner_1, speed_runner_1
)
sprintf(
    "Runner 2 ran %s km in %s at a speed of %5.2f km/hr",
    distance, runner_2, speed_runner_2
)
sprintf(
    "Runner 3 ran %s km in %s at a speed of %5.2f km/hr",
    distance, runner_3, speed_runner_3
)
sprintf(
    "Runner 4 ran %s km in %s at a speed of %5.2f km/hr",
    distance, runner_4, speed_runner_4
)

## [1] "Runner 1 ran 5 km in 00:18:47 at a speed of 15.97 km/hr"
## [1] "Runner 2 ran 5 km in 00:19:03 at a speed of 15.75 km/hr"
## [1] "Runner 3 ran 5 km in 00:19:20 at a speed of 15.52 km/hr"
## [1] "Runner 4 ran 5 km in 00:19:54 at a speed of 15.08 km/hr"

Although the code works fine there is an awful lot of repetition! There is a principle in programming called DRY: don’t repeat yourself. In general, if you find yourself copy-pasting code from within your own script or performing the same calculation more that twice you should consider moving it to it’s own function:

time_to_speed <- function(i, time, distance) {
    speed <- as.numeric(distance / times(time) / 24)
    msg <- sprintf(
        "Runner %s ran %s km in %s at a speed of %5.2f km/hr",
        i, distance, time, speed
    )
    print(msg)
}

distance <- 5  # km
times <- c("00:18:47", "00:19:03", "00:19:20", "00:19:54")
for (i in seq_along(times)) {
    time_to_speed(i, times[i], distance)
}

## [1] "Runner 1 ran 5 km in 00:18:47 at a speed of 15.97 km/hr"
## [1] "Runner 2 ran 5 km in 00:19:03 at a speed of 15.75 km/hr"
## [1] "Runner 3 ran 5 km in 00:19:20 at a speed of 15.52 km/hr"
## [1] "Runner 4 ran 5 km in 00:19:54 at a speed of 15.08 km/hr"

You have the same output with no repetition. This is the power of a function. Note how it has been created:

The function is assigned to the name time_to_speed which hints at the fact that you are converting time into speed
The word function is used
Round brackets are used to list the arguments or inputs to the function, in this case:
- i: which number runner it is
- time: that runner’s time as a string in hr:min:sec format
- distance: the length of the race in km
The steps performed by the function are detailed between curly brackets

You could clean the code up even further by moving the function into a separate file and importing it as if it were a library.

2 ‘Apply’ Functions to Data

If you give a column of a data frame to a function, how will the function know whether you want it to act on the column as a whole (ie treat the column as one thing) or on every individual item in the column in turn (ie treat the column as multiple things)? What if you have a multi-step function with many if-else statements determining what happens? One way to deal with complicated situations like this is to define a custom function and then apply it to an input. Here’s an example that uses a subset of the built-in iris dataset:

# Create the data to be used in this example
df = iris[c(1, 2, 3, 51, 52, 53, 101, 102, 103), ]
print(df)

##     Sepal.Length Sepal.Width Petal.Length Petal.Width    Species
## 1            5.1         3.5          1.4         0.2     setosa
## 2            4.9         3.0          1.4         0.2     setosa
## 3            4.7         3.2          1.3         0.2     setosa
## 51           7.0         3.2          4.7         1.4 versicolor
## 52           6.4         3.2          4.5         1.5 versicolor
## 53           6.9         3.1          4.9         1.5 versicolor
## 101          6.3         3.3          6.0         2.5  virginica
## 102          5.8         2.7          5.1         1.9  virginica
## 103          7.1         3.0          5.9         2.1  virginica

Now, here’s a function that finds flowers of the ‘virginica’ species:

find_virginica <- function(val) {
    if (val == "virginica") {
        return(TRUE)
    } else {
        return(FALSE)
    }
}

This function can be applied to the ‘Species’ row of our data frame by using the sapply() function. The ‘s’ in ‘sapply’ is short for “string”, which indicates that this is the type of data our function can be applied to:

virginica_flowers = sapply(df[["Species"]], find_virginica)
print(virginica_flowers)

## [1] FALSE FALSE FALSE FALSE FALSE FALSE  TRUE  TRUE  TRUE

It has correctly identified that only the last three rows refer to virginica flowers. Now, here’s sapply being used on multiple columns at once:

for (colname in colnames(df)) {
    df[[colname]] <- sapply(df[[colname]], find_virginica)
}
print(df)

##     Sepal.Length Sepal.Width Petal.Length Petal.Width Species
## 1          FALSE       FALSE        FALSE       FALSE   FALSE
## 2          FALSE       FALSE        FALSE       FALSE   FALSE
## 3          FALSE       FALSE        FALSE       FALSE   FALSE
## 51         FALSE       FALSE        FALSE       FALSE   FALSE
## 52         FALSE       FALSE        FALSE       FALSE   FALSE
## 53         FALSE       FALSE        FALSE       FALSE   FALSE
## 101        FALSE       FALSE        FALSE       FALSE    TRUE
## 102        FALSE       FALSE        FALSE       FALSE    TRUE
## 103        FALSE       FALSE        FALSE       FALSE    TRUE

The find_virginica() function has found all the instances of virginica flowers in the table by looking at each row in each column in turn.

3 Summary

Functions are units of code that perform a specific task, such as a statistical test, the plotting of a graph or the conversion of time into speed
Many functions are included in base R, many more are available through importing libraries and even more are possible when created by the user.
Advantages of using functions are that you avoid the repetition of your own code and you can use someone else’s code without having to re-invent it.
The cleanest code will have functions in separate files and then import them into the main script.
Complex functions can be applied to data.

⇦ Back