Remove any values you have stored from previous runs.
rm(list = ls())
By default, RStudio sets your working directory (ie the folder everything gets imported from, exported to and run from) to you home directory. This will be:
/Users/<yourname>/
on macOS/home/<yourname>/
on LinuxC:\Users\
on WindowsChange this to be the same folder as the one your script is saved in:
setwd(dirname(parent.frame(2)$ofile))
For this tutorial you will need the ‘titanic’, ‘knitr’, ‘ggplot2’ and ‘pROC’ packages.
The data will come from ‘titanic’; see its documentation here: https://www.rdocumentation.org/packages/titanic/versions/0.1.0
When you want to use functionality that is part of a particular package, you need to import the package as a library in the script:
install.packages("titanic", repos = "http://cran.us.r-project.org")
install.packages("knitr", repos = "http://cran.us.r-project.org")
install.packages("ggplot2", repos = "http://cran.us.r-project.org")
install.packages("pROC", repos = "http://cran.us.r-project.org")
library(titanic)
library(knitr)
library(ggplot2)
library(pROC)
The titanic dataset is actually split into two: a ‘training’ subset and a ‘test’ subset. We’re going to be working with the training subset, which is called titanic_train
.
When previewing the data, don’t look at all of it at once because you don’t know how much there is. If there is a massive amount it might crash RStudio! Instead, only look at the first few rows to get a general idea. This can be done by using the head()
function:
print(head(titanic_train))
## PassengerId Survived Pclass Name Sex Age SibSp Parch Ticket Fare Cabin Embarked
## 1 20 1 3 Masselmani, Mrs. Fatima female NA 0 0 2649 7.2250 C
## 2 21 0 2 Fynney, Mr. Joseph J male 35 0 0 239865 26.0000 S
## 3 22 1 2 Beesley, Mr. Lawrence male 34 0 0 248698 13.0000 D56 S
## 4 23 1 3 McGowan, Miss. Anna "Annie" female 15 0 0 330923 8.0292 Q
## 5 24 1 1 Sloper, Mr. William Thompson male 28 0 0 113788 35.5000 A6 S
## 6 25 0 3 Palsson, Miss. Torborg Danira female 8 3 1 349909 21.0750 S
Kable is a table generator which comes from the knitr package. It can automatically convert tables to HTML, Latex or Markdown format, making it easy to add them to a report or a webpage.
print(kable(head(titanic_train)))
##
##
## | PassengerId| Survived| Pclass|Name |Sex | Age|Ticket | Fare|Cabin |Embarked |
## |-----------:|--------:|------:|:-----------------------------|:------|---:|:------|-------:|:-----|:--------|
## | 20| 1| 3|Masselmani, Mrs. Fatima |female | NA|2649 | 7.2250| |C |
## | 21| 0| 2|Fynney, Mr. Joseph J |male | 35|239865 | 26.0000| |S |
## | 22| 1| 2|Beesley, Mr. Lawrence |male | 34|248698 | 13.0000|D56 |S |
## | 23| 1| 3|McGowan, Miss. Anna "Annie" |female | 15|330923 | 8.0292| |Q |
## | 24| 1| 1|Sloper, Mr. William Thompson |male | 28|113788 | 35.5000|A6 |S |
## | 25| 0| 3|Palsson, Miss. Torborg Danira |female | 8|349909 | 21.0750| |S |
This step isn’t actually necessary, but for the sake of this tutorial let’s export the dataset to csv. You’ll then be able to open it in Excel or Calc and share or save it.
write.csv(titanic_train, "titanic_train.csv")
There should now be a file called “titanic_train.csv” in the same folder where this script is saved.
Again, this step isn’t necessary but for the sake of completeness let’s import the csv we just exported:
df <- read.csv("titanic_train.csv")
The data has been imported as a data frame and assigned to the variable df. A data frame is the fundamental data type in R; it’s essentially a table of data where each column has a column heading and each row (usually) only has a number (ie there are no row names, only row numbers). Have a look at the column headings here:
print(colnames(df))
## [1] "X" "PassengerId" "Survived" "Pclass" "Name" "Sex" "Age" "SibSp" "Parch" "Ticket" "Fare" "Cabin" "Embarked"
…and the number of rows here:
print(nrow(df))
## [1] 891
We can chop and change the data frame to look exactly how we want it:
Remove all columns except ‘Name’, ‘Sex’ and ‘Age’:
subset <- subset(df, select = c("Name", "Sex", "Age"))
print(head(subset))
## Name Sex Age
## 1 Braund, Mr. Owen Harris male 22
## 2 Cumings, Mrs. John Bradley (Florence Briggs Thayer) female 38
## 3 Heikkinen, Miss. Laina female 26
## 4 Futrelle, Mrs. Jacques Heath (Lily May Peel) female 35
## 5 Allen, Mr. William Henry male 35
## 6 Moran, Mr. James male NA
Was there a Mr James Flynn onboard?
bool <- "Flynn, Mr. James" %in% df$"Name"
print(bool)
## [1] TRUE
In which row is Miss Joan Wells?
idx <- match("Wells, Miss. Joan", df$"Name")
print(idx)
## [1] 751
Let’s look at only the passengers in third class:
subset <- subset(df, Pclass == "3")
print(head(subset))
## X PassengerId Survived Pclass Name Sex Age SibSp Ticket Fare Cabin Embarked
## 1 51 51 0 3 Panula, Master. Juha Niilo male 7.0 4 3101295 39.6875 S
## 2 52 52 0 3 Nosworthy, Mr. Richard Cater male 21.0 0 A/4. 39886 7.8000 S
## 3 58 58 0 3 Novel, Mr. Mansouer male 28.5 0 2697 7.2292 C
## 4 60 60 0 3 Goodwin, Master. William Frederick male 11.0 5 CA 2144 46.9000 S
## 5 61 61 0 3 Sirayanian, Mr. Orsen male 22.0 0 2669 7.2292 C
## 6 64 64 0 3 Skoog, Master. Harald male 4.0 3 347088 27.9000 S
Now only the passengers between the ages of 20 and 30:
subset <- subset(df, Age >= 20 & Age < 30)
print(head(subset))
## X PassengerId Survived Pclass Name Sex Age SibSp Ticket Fare Cabin Embarked
## 1 73 73 0 2 Hood, Mr. Ambrose Jr male 21 0 S.O.C. 14879 73.5000 S
## 2 74 74 0 3 Chronopoulos, Mr. Apostolos male 26 1 2680 14.4542 C
## 3 76 76 0 3 Moen, Mr. Sigurd Hansen male 25 0 348123 7.6500 F G73 S
## 4 81 81 0 3 Waelens, Mr. Achille male 22 0 345767 9.0000 S
## 5 82 82 1 3 Sheerlinck, Mr. Jan Baptist male 29 0 345779 9.5000 S
## 6 84 84 0 1 Carrau, Mr. Francisco M male 28 0 113059 47.1000 S
How much was the cheapest, most expensive, average and median ticket?
min <- min(df$Fare)
max <- max(df$Fare)
mean <- mean(df$Fare)
median <- median(df$Fare)
print(min)
print(max)
print(mean)
print(median)
## [1] 0
## [1] 512.3292
## [1] 32.20421
## [1] 14.4542
Let’s ask the following question: were those people who paid more for their ticket more likely to survive? It’s a bit hard to even guess at the answer just by scanning your eye down the table of numbers, so let’s try visualising the data first. This can be done with a box plot (aka a box-and-whisker plot):
boxplot(Fare~Survived, data = df)
The boxplot()
function will create the graph for us, and the arguments we specified will tell it what to do:
data
keyword argument was set equal to df
as that is the dataset we want to usedf
data frame that we want to use are Fare
on the y-axis and Survived
on the x-axis. R uses the tilde symbol (~) - which means ‘proportional to’ in statistics - to specify which variable is being set proportional to another in the plot. Notice that the argument was Fare~Survived
, ie the dependant variable was named first.This figure can be made more attractive by customising the function’s settings, which won’t be discussed here. However, a quick way to get a better-looking graph would be to re-do it with the ggplot2
package:
df$"Survived"[df$"Survived" %in% 0] <- "Died"
df$"Survived"[df$"Survived" %in% 1] <- "Lived"
p <- ggplot(df, aes(Survived, Fare))
p <- p + geom_boxplot()
p <- p + ggtitle(
"Were the passengers more likely to survive if they had paid more?"
)
p <- p + xlab("")
print(p)
As you can see, this is more complicated but it arguably produces a nicer figure.
Is this difference significant? We can’t immediately assume that the data is Normally distributed, so let’s use a non-parametric method such as the Wilcox test/Mann-Whitney U test to decide if one group is larger than the other (hint: in order to separate the two groups - Died and Survived - we will have to filter the data frame before performing the Wilcox test):
died <- subset(df, Survived == "Died")$Fare
lived <- subset(df, Survived == "Lived")$Fare
test <- wilcox.test(died, lived, alternative = "two.sided")
print(test$p.value)
## [1] 4.553477e-22
That’s pretty convincing. Therefore, it’s safe to conclude that a passenger who paid a higher fare would have been more likely to survive than one who paid a lower fare.
Now let’s ask a slightly different question: can a passenger’s fare price be used to predict where or not they survived?
Let’s create a predictive test. Recall that the median ticket price paid by passengers was £14.45 (we calculated that using median(df$Fare)
), so let’s use that as a cut-off and make the follow predictions:
Prediction
that predicts whether each passenger survived or died - ie it categorises them into ‘predicted to survive’ and ‘predicted to die’ groups (hint: if the fare was larger than 14.45, the prediction is that they lived, else it’s that they died):df$prediction <- ifelse(df$Fare > 14.45, "Lived", "Died")
print(head(df[c("Fare", "prediction")]))
## Fare prediction
## 1 7.2500 Died
## 2 71.2833 Lived
## 3 7.9250 Died
## 4 53.1000 Lived
## 5 8.0500 Died
## 6 8.4583 Died
The next steps aren’t strictly necessary, but they change the statistically ‘positive’ outcome of our test from ‘Died’ to ‘Survived’. Try commenting out the following four lines and running the code that follows this. Then uncomment them and run it again to see what change they make.
df$prediction <- factor(df$prediction)
df$prediction <- relevel(df$prediction, ref = "Lived")
df$survived <- factor(df$Survived)
df$survived <- relevel(df$survived, ref = "Lived")
We can investigate the accuracy of our test by creating a confusion matrix (https://en.wikipedia.org/wiki/Confusion_matrix) (hint: this is a table that compares a test with a ‘ground truth’ reference):
cm <- table(test = df$prediction, ref = df$survived)
print(cm)
## ref
## test Lived Died
## Lived 231 220
## Died 111 329
print(nrow(df))
## [1] 891
Some of our ‘positive’ predictions (that the passenger survived) came true while some were false. Some of our ‘negative’ predictions (that the passenger died) came true while some were false. Extract the number of true positives, false positives, false negatives and true negatives from the confusion matrix (hint: this will require 2-dimensional indexing because a confusion matrix has two dimensions):
tp <- cm[1, 1]
fp <- cm[1, 2]
fn <- cm[2, 1]
tn <- cm[2, 2]
print(tp)
print(fp)
print(fn)
print(tn)
## [1] 231
## [1] 220
## [1] 111
## [1] 329
We have all the information we need to calculate the four most important results of a confusion matrix investigation:
sens <- tp / (tp + fn)
spec <- tn / (tn + fp)
ppv <- tp / (tp + fp)
npv <- tn / (tn + fn)
print(sens)
print(spec)
print(ppv)
print(npv)
## [1] 0.6754386
## [1] 0.5992714
## [1] 0.5121951
## [1] 0.7477273
Unfortunately these values aren’t very good! Maybe it’s because the cut-off we chose (14.45) wasn’t good? We’ll have to investigate how the sensitivity and specificity change when we use different cut-offs.
A ROC curve illustrates how the diagnostic accuracy (ie the sensitivity and specificity) of a binary classification test (eg predicting whether a passenger survived or died) changes as its threshold changes. Fortunately for us, there is a function that can do most of the work for us (hint: this function calculates the receiver operating characteristic):
r <- roc(Survived~Fare, data = df)
Have a look at some of the thresholds (ie values for fare that will be taken as cut-offs) that this function produced:
head(r$thresholds)
## [1] -Inf 2.00625 4.50625 5.61875 6.33750 6.44375
plot(r, type = "S")
Get the area under the curve and the 95% confidence interval of this area (which should be rounded off to 2 decimal places):
auc <- auc(r)
ci <- ci.auc(r)
ci_l <- round(ci[1], 2)
ci_u <- round(ci[3], 2)
Add this information to the graph:
plot(r, type = "S")
legend_text <- paste0(
"AUC = ", round(auc, 2), " (95% CI = ", ci_l, " - ", ci_u, ")"
)
legend("bottomright", legend = legend_text, pch = 15)
p <- ggroc(r)
p <- p + ggtitle("Receiver Operating Characteristic Curve")
p <- p + geom_segment(
aes(x = 1, xend = 0, y = 0, yend = 1), color = "grey", linetype = "dashed"
)
p <- p + scale_y_continuous(expand = c(0, 0))
p <- p + scale_x_reverse(expand = c(0, 0))
legend_text <- paste0(
"AUC = ", round(auc, 2), " (95% CI = ", ci_l, " - ", ci_u, ")"
)
p <- p + annotate("text", x = 0.3, y = 0.05, label = legend_text)
print(p)
The area under the curve is 0.69; not particularly high! It seems that - despite the fact that the Mann-Whitney U test was highly significant - fare can’t be used as an accurate predictor of passenger survival!