⇦ Back

This page just talks about how to plot receiver operating characteristic (ROC) curves. For more on what they are and how to use them, see this tutorial. Briefly, a ROC curve illustrates how the diagnostic accuracy (ie the sensitivity and specificity) of a binary classification test (eg predicting whether something did or didn’t happen) changes as its threshold changes.

1 Example Data

A good data set to use is the training data subset from the titanic package. This contains information about the passengers on board the Titanic, how much they paid for their fare and whether or not they survived. Install this package in R with:

# Install the package
install.packages("titanic", repos = "http://cran.us.r-project.org")

Then load and take a look at the data:

# Import the package
library(titanic)
# Change the width of the console
options("width" = 200)
# Take a look at the first 6 rows of the data
head(titanic_train[c("Survived", "Pclass", "Name", "Sex", "Age", "Fare", "Cabin", "Embarked")])
##   Survived Pclass                                                Name    Sex Age    Fare Cabin Embarked
## 1        0      3                             Braund, Mr. Owen Harris   male  22  7.2500              S
## 2        1      1 Cumings, Mrs. John Bradley (Florence Briggs Thayer) female  38 71.2833   C85        C
## 3        1      3                              Heikkinen, Miss. Laina female  26  7.9250              S
## 4        1      1        Futrelle, Mrs. Jacques Heath (Lily May Peel) female  35 53.1000  C123        S
## 5        0      3                            Allen, Mr. William Henry   male  35  8.0500              S
## 6        0      3                                    Moran, Mr. James   male  NA  8.4583              Q

2 Calculating Thresholds

Fortunately for us, there is a function that can do most of the work when it comes to ROC analysis: the unsurprisingly named roc() function. It comes with the pROC library which you can install via the following:

# Install the package
install.packages("pROC", repos = "http://cran.us.r-project.org")

Then load and use:

library(pROC)

r <- roc(Survived ~ Fare, data = titanic_train)

Have a look at some of the thresholds (ie values for fare that will be taken as cut-offs) that this function produced:

head(r$thresholds)
## [1]    -Inf 2.00625 4.50625 5.61875 6.33750 6.44375

3 Plot

plot(r, type = "S")

Get the area under the curve and the 95% confidence interval of this area (which should be rounded off to 2 decimal places):

auc <- auc(r)
ci <- ci.auc(r)
ci_l <- round(ci[1], 2)
ci_u <- round(ci[3], 2)

Add this information to the graph:

plot(r, type = "S")
legend_text <- paste0(
    "AUC = ", round(auc, 2), " (95% CI = ", ci_l, " - ", ci_u, ")"
)
legend("bottomright", legend = legend_text, pch = 15)

The area under the curve is 0.69; not particularly high! It seems that fare can’t be used as an accurate predictor of passenger survival.

⇦ Back