This page just talks about how to plot receiver operating characteristic (ROC) curves. For more on what they are and how to use them, see this tutorial. Briefly, a ROC curve illustrates how the diagnostic accuracy (ie the sensitivity and specificity) of a binary classification test (eg predicting whether something did or didn’t happen) changes as its threshold changes.
A good data set to use is the training data subset from the titanic
package. This contains information about the passengers on board the Titanic, how much they paid for their fare and whether or not they survived. Install this package in R with:
# Install the package
install.packages("titanic", repos = "http://cran.us.r-project.org")
Then load and take a look at the data:
# Load the package
library(titanic)
# Change the width of the console
options("width" = 200)
# Take a look at the first 6 rows of the data
head(titanic_train[c("Survived", "Pclass", "Name", "Sex", "Age", "Fare", "Cabin", "Embarked")])
## Survived Pclass Name Sex Age Fare Cabin Embarked
## 1 0 3 Braund, Mr. Owen Harris male 22 7.2500 S
## 2 1 1 Cumings, Mrs. John Bradley (Florence Briggs Thayer) female 38 71.2833 C85 C
## 3 1 3 Heikkinen, Miss. Laina female 26 7.9250 S
## 4 1 1 Futrelle, Mrs. Jacques Heath (Lily May Peel) female 35 53.1000 C123 S
## 5 0 3 Allen, Mr. William Henry male 35 8.0500 S
## 6 0 3 Moran, Mr. James male NA 8.4583 Q
Fortunately for us, there is a function that can do most of the work when it comes to ROC analysis: the unsurprisingly named roc()
function. It comes with the pROC
library which you can install via the following:
# Install the package
install.packages("pROC", repos = "http://cran.us.r-project.org")
Then load and use:
library(pROC)
r <- roc(Survived ~ Fare, data = titanic_train)
Have a look at some of the thresholds (ie values for fare that will be taken as cut-offs) that this function produced:
head(r$thresholds)
## [1] -Inf 2.00625 4.50625 5.61875 6.33750 6.44375
library(ggplot2)
p <- ggroc(r)
p <- p + ggtitle("Receiver Operating Characteristic Curve")
p <- p + ylab("Sensitivity")
p <- p + xlab("Specificity")
p <- p + geom_segment(
aes(x = 1, xend = 0, y = 0, yend = 1), color = "grey", linetype = "dashed"
)
p <- p + scale_y_continuous(expand = c(0, 0))
p <- p + scale_x_reverse(expand = c(0, 0))
print(p)
Get the area under the curve and the 95% confidence interval of this area (which should be rounded off to 2 decimal places):
auc <- auc(r)
ci <- ci.auc(r)
ci_l <- round(ci[1], 2)
ci_u <- round(ci[3], 2)
Add this information to the graph:
p <- ggroc(r)
p <- p + ggtitle("Receiver Operating Characteristic Curve")
p <- p + ylab("Sensitivity")
p <- p + xlab("Specificity")
p <- p + geom_segment(
aes(x = 1, xend = 0, y = 0, yend = 1), color = "grey", linetype = "dashed"
)
p <- p + scale_y_continuous(expand = c(0, 0))
p <- p + scale_x_reverse(expand = c(0, 0))
legend_text <- paste0(
"AUC = ", round(auc, 2), " (95% CI = ", ci_l, " - ", ci_u, ")"
)
p <- p + annotate("text", x = 0.3, y = 0.05, label = legend_text)
print(p)
The area under the curve is 0.69; not particularly high! It seems that fare can’t be used as an accurate predictor of passenger survival.