The code on this page uses the pyDataset, pandas, NumPy, scikit-learn and Matplotlib packages. These can be installed from the terminal with the following commands:
# `python3.12` corresponds to the version of Python you have installed and are using
$ python3.12 -m pip install pydataset
$ python3.12 -m pip install pandas
$ python3.12 -m pip install numpy
$ python3.12 -m pip install sklearn
$ python3.12 -m pip install matplotlib
Once finished, import these packages into your Python script as follows:
from pydataset import data
import pandas as pd
import numpy as np
from sklearn import tree
from sklearn import model_selection
from sklearn import metrics
from matplotlib import pyplot as plt
This tutorial will use the “titanic” dataset from pyDataset. This contains data about all* 1,316 non-crew passengers aboard the ill-fated Titanic on 14 April 1912, including what class of ticket they held, whether they were an adult or child, male or female and if they survived or not:
# Import raw data
df = data('titanic')
print(df.head())
## class age sex survived
## 1 1st class adults man yes
## 2 1st class adults man yes
## 3 1st class adults man yes
## 4 1st class adults man yes
## 5 1st class adults man yes
*There is uncertainty about the exact passenger list and hence the exact number of people onboard, but this dataset is consistent with a British Board of Trade report on the disaster from the time
Note that pyDataset actually contains three datasets with the Titanic’s passenger survival data:
There is also a “Lifeboats” dataset which contains info about the lifeboats on the Titanic.
Let’s double-check that we indeed have 1,316 data points:
# Sample size
total = df.shape[0]
print(total)
## 1316
For reasons that will become clear later, we will first analyse the overall survival of passengers, then the survival of passengers by sex and then the survival of passengers by sex and class. For now, the important thing to notice is the increase in complexity of the analysis with each additional variable.
died = len([v for v in df['survived'] if v == 'no'])
survived = len([v for v in df['survived'] if v == 'yes'])
print(f'Died: {died}, survived: {survived}')
## Died: 817, survived: 499
As a percentage:
survived_100 = survived / total * 100
print(f'Percentage survived: {survived_100:.1f}%')
## Percentage survived: 37.9%
The .value_counts()
method can achieve the same as filtering:
survived = df['survived'].value_counts()
print(survived)
## survived
## no 817
## yes 499
## Name: count, dtype: int64
As percentages:
survived_100 = survived / total * 100
print(survived_100)
## survived
## no 62.082067
## yes 37.917933
## Name: count, dtype: float64
So, when only dealing with one variable - survived
- it doesn’t really make a difference which method you use and the code is relatively short either way.
Now let’s break the dataset down by sex. First, let’s see how many men and women there were:
counts = df['sex'].value_counts()
print(counts)
## sex
## man 869
## women 447
## Name: count, dtype: int64
As percentages:
counts_100 = counts / total * 100
print(counts_100)
## sex
## man 66.033435
## women 33.966565
## Name: count, dtype: float64
How many of these men and women survived?
Using ‘manual’ filtering is one option for breaking the dataset up by the survived
and sex
variables:
mask = (df['survived'] == 'no') & (df['sex'] == 'women')
women_died = df[mask].shape[0]
mask = (df['survived'] == 'yes') & (df['sex'] == 'women')
women_survived = df[mask].shape[0]
women_survived_100 = women_survived / (women_survived + women_died)
women_died_100 = women_died / (women_survived + women_died)
print(f'Women survived: {women_survived} ({women_survived_100:.1%}), died: {women_died} ({women_died_100:.1%})')
## Women survived: 324 (72.5%), died: 123 (27.5%)
mask = (df['survived'] == 'no') & (df['sex'] == 'man')
men_died = df[mask].shape[0]
mask = (df['survived'] == 'yes') & (df['sex'] == 'man')
men_survived = df[mask].shape[0]
men_died_100 = men_died / (men_survived + men_died)
men_survived_100 = men_survived / (men_survived + men_died)
print(f'Men survived: {men_survived} ({men_survived_100:.1%}), died: {men_died} ({men_died_100:.1%})')
## Men survived: 175 (20.1%), died: 694 (79.9%)
All of a sudden we’re needing to write a lot of lines of code just to separate the data by one dependent and one independent variable! Let’s have a look at other methods:
Using the .groupby()
method means we only need to write two lines of code to do the equivalent of the above, but the problem is that we get percentage survival rates relative to the total number of passengers, not just men and women:
grouped = df.groupby(['sex', 'survived']).count().iloc[:, 0]
print(grouped)
## sex survived
## man no 694
## yes 175
## women no 123
## yes 324
## Name: class, dtype: int64
percentages = grouped / total * 100
print(percentages)
## sex survived
## man no 52.735562
## yes 13.297872
## women no 9.346505
## yes 24.620061
## Name: class, dtype: float64
Another two lines can fix that:
women_survived_100 = grouped['women'] / df['sex'].value_counts()['women'] * 100
print(women_survived_100)
## survived
## no 27.516779
## yes 72.483221
## Name: class, dtype: float64
men_survived_100 = grouped['man'] / df['sex'].value_counts()['man'] * 100
print(men_survived_100)
## survived
## no 79.86191
## yes 20.13809
## Name: class, dtype: float64
Creating a pivot table works but is clumsy because you end up with two columns per additional variable (‘age’ and ‘class’ in this case):
pt = pd.pivot_table(df, index='sex', columns='survived', aggfunc='count')
print(pt)
## age class
## survived no yes no yes
## sex
## man 694 175 694 175
## women 123 324 123 324
Creating a cross tabulation (a contingency table) will get you the raw numbers (as opposed to the percentages) in one line:
ct = pd.crosstab(df['sex'], df['survived'])
print(ct)
## survived no yes
## sex
## man 694 175
## women 123 324
Now to get the percentages. One option is to normalise the cross tabulation (using the normalize
parameter) but this, again, gives you the percentages of the total number of passengers:
ct_norm = pd.crosstab(df['sex'], df['survived'], normalize=True) * 100
print(ct_norm)
## survived no yes
## sex
## man 52.735562 13.297872
## women 9.346505 24.620061
You can use NumPy vectors to get the values that you want, but this is tricky because you need to remember to transpose:
percentages = ct.values / np.array([ct.sum(axis=1).values]).T * 100
print(percentages)
## [[79.86191024 20.13808976]
## [27.51677852 72.48322148]]
The easiest is probably just to keep using pandas:
percentages = ct.divide(ct.sum(axis=1), axis=0) * 100
print(percentages)
## survived no yes
## sex
## man 79.861910 20.138090
## women 27.516779 72.483221
Now if we were to add in an extra variable - class
- using manual filtering to analyse the data would take forever! Using what we learnt above we can still break it all down in just two lines:
ct = pd.crosstab([df['sex'], df['class']], df['survived'])
print(ct)
## survived no yes
## sex class
## man 1st class 118 62
## 2nd class 154 25
## 3rd class 422 88
## women 1st class 4 141
## 2nd class 13 93
## 3rd class 106 90
percentages = ct.divide(ct.sum(axis=1), axis=0) * 100
print(percentages)
## survived no yes
## sex class
## man 1st class 65.555556 34.444444
## 2nd class 86.033520 13.966480
## 3rd class 82.745098 17.254902
## women 1st class 2.758621 97.241379
## 2nd class 12.264151 87.735849
## 3rd class 54.081633 45.918367
Using a decision tree will essentially take the above information one step further: given details about a passenger and the survival rates calculated above we can classify that passenger as either a probable survivor or a probable non-survivor.
Firstly, we will re-format the data by separating out the target (dependent) variable from the predictor (independent) variables and converting the latter into binary values using the .get_dummies()
method:
# Set the target and predictor variables
X = pd.get_dummies(df.iloc[:, :-1])
y = df['survived']
print(X.head())
## class_1st class class_2nd class class_3rd class age_adults age_child sex_man sex_women
## 1 True False False True False True False
## 2 True False False True False True False
## 3 True False False True False True False
## 4 True False False True False True False
## 5 True False False True False True False
print(y.head())
## 1 yes
## 2 yes
## 3 yes
## 4 yes
## 5 yes
## Name: survived, dtype: object
For the moment, we will use all of this data to train the decision tree:
X_train = X
y_train = y
Use DecisionTreeClassifier()
from scikit-learn’s tree
sub-module to create the decision tree:
# Create the model
model = tree.DecisionTreeClassifier(
max_depth=3, ccp_alpha=0.01, random_state=20230808
)
model.fit(X_train, y_train)
max_depth
sets the maximum number of splits a path through the decision tree will encounterccp_alpha
is used for cost-complexity pruning. The default is to not do any pruning.random_state
sets the seed for the random number generator used in the background. Setting this ensures we get the same result each time we run this code.Let’s visualise the tree by using the plot_tree()
function:
# Plot
tree.plot_tree(
model, feature_names=list(X_train),
class_names=['Died', 'Survived'], filled=True
)
plt.tight_layout()
plt.show()
It’s not immediately obvious what a lot of the information in this diagram means, so let’s start at the root node (the topmost box):
samples = 1316
reflects the fact that there were 1,316 non-crew passengers on the Titanic, ie that’s the number of rows in our data frame/the number of data points we havevalue = [817, 499]
reflects the fact that 817 passengers died while 499 survived. We calculated these numbers earlier in this tutorial.class = Died
means that more people died than survived - at this point our decision tree is classifying everyone as having diedsex_women <= 0.5
is saying that, if we want to predict the fate of a particular passenger and the sex_women
value for that passenger is less than or equal to 0.5 (remember that we used the .get_dummies()
method to convert all variables into binary values, so sex_women = True = 1
means that the passenger was a women while sex_women = False = 0
means that they were a man) then we should follow the left-hand path. In other words, if the passenger is a man we should go left and if they are a women we should go right.gini = 0.471
is the value of Gini impurity (\(I_G\)) for the data at this node: we currently have a ‘mixture’ of passengers who survived and died whereas our goal is to separate out those who survived and those who died. Doing so would reduce the Gini impurity: we would have less of a mixture and more of a pure substance (to use a chemistry analogy). More specifically, this value is calculated from the following formula:\[ \begin{aligned} I_G &= 1 - (p_1^2 + p_2^2) \\ &= 1 - (p_1^2 + (1 - p_1)^2) \end{aligned} \]
where \(p_1\), \(p_2\) are the proportions of passengers who survived and died. This can be calculated in Python as follows:
# Confirm Gini impurity for the root node
value = [817, 499]
gini = 1 - sum([(x / sum(value))**2 for x in value])
print(f'Gini impurity, IG = {gini:.3f}')
## Gini impurity, IG = 0.471
or:
# Confirm Gini impurity for the root node
gini = 1 - sum(y_train.value_counts(normalize=True)**2)
print(f'Gini impurity, IG = {gini:.3f}')
## Gini impurity, IG = 0.471
A lot of the other values we calculated earlier in the tutorial can be read off from the decision tree: looking at the left-hand side box we see that there were 869 men, 694 of whom died and 175 of whom survived. Likewise, we can look at the box in the middle on the right-hand side and see that there were 447 women, 123 of whom died and 324 of whom survived. The second split on the right-hand side branch classifies women passengers on whether or not they had 3rd class tickets: 196 did and 106 of them died whereas 90 of them survived. These numbers were calculated at various stages of the Initial Data Analysis section.
So, in summary, the decision tree is splitting the data up into groups and classifying the passengers in each group as either having survived or died depending on the proportion of actual outcomes for that group. Unfortunately, this particular tree is not very good: it suggests that all men and third class women died and that the remainder (1st and 2nd class women) all survived. Let’s see if we can improve this a bit:
Let’s create a brand new tree, but this time separate the data out into a training and a testing set (70% for training and 30% for testing) instead of using 100% for training:
# Split into training and testing sets
X_train, X_test, y_train, y_test = model_selection.train_test_split(
X, y, random_state=20230808, test_size=0.3
)
Now we can create it as we did previously:
# Create the model
model = tree.DecisionTreeClassifier(
max_depth=3, ccp_alpha=0.01, random_state=20230808
)
model.fit(X_train, y_train)
Does the tree look different?
# Plot
tree.plot_tree(
model, feature_names=list(X_train),
class_names=['Died', 'Survived'], filled=True
)
plt.tight_layout()
plt.show()
The first split is still on sex
, but the left-hand branch now splits on age
and then the right-hand side of that splits on class
. The right-hand branch is unchanged (the exact numbers are different because we are only using 70% of the data, but the split is still on class
).
The scikit-learn module also has an export_text()
function that will show the tree as text:
# Visualise
text_tree = tree.export_text(
model, feature_names=X_train.columns, class_names=['Died', 'Survived']
)
print(text_tree)
## |--- sex_women <= 0.50
## | |--- age_child <= 0.50
## | | |--- class: Died
## | |--- age_child > 0.50
## | | |--- class_3rd class <= 0.50
## | | | |--- class: Survived
## | | |--- class_3rd class > 0.50
## | | | |--- class: Died
## |--- sex_women > 0.50
## | |--- class_3rd class <= 0.50
## | | |--- class: Survived
## | |--- class_3rd class > 0.50
## | | |--- class: Died
As has been done above, we can calculate the Gini impurity for a node (eg the root node):
# Confirm Gini impurity for the root node
value = [571, 350]
gini = 1 - sum([(x / sum(value))**2 for x in value])
print(f'Gini impurity, IG = {gini:.3f}')
## Gini impurity, IG = 0.471
or:
# Confirm Gini impurity for the root node
gini = 1 - sum(y_train.value_counts(normalize=True)**2)
print(f'Gini impurity, IG = {gini:.3f}')
## Gini impurity, IG = 0.471
The whole point of having splits in the decision tree is to help classify data points, and the degree to which a split works in this regard can be quantified by calculating the amount of ‘information’ we gain by performing it. This is equivalent to the reduction in (weighted) impurity achieved by a split.
#
# Calculate information gain of first split
#
left = y_train[X_train['sex_women'] <= 0.5]
right = y_train[X_train['sex_women'] > 0.5]
# Weighted Gini impurity = sum([samples / n * gini for each node])
# Weight for Gini impurity of the left branch
w = float(len(left)) / (len(left) + len(right))
# Gini impurity of the left branch
gini_left = 1 - sum(left.value_counts(normalize=True)**2)
# Gini impurity of the right branch
gini_right = 1 - sum(right.value_counts(normalize=True)**2)
# Information gain =
# Reduction in impurity after a split =
# old weighted Gini impurity - new weighted Gini impurity =
# current_impurity - w * gini(left) - (1 - w) * gini(right)
info_gain = gini - w * gini_left - (1 - w) * gini_right
print(f'Information gain = {info_gain:.3f}')
## Information gain = 0.121
So, is this tree any good? Let’s see how good it is at classifying the data in the testing set:
# Make predictions
y_pred = model.predict(X_test)
# Accuracy of the predictions
acc = metrics.accuracy_score(y_test, y_pred)
print(f'Accuracy = {acc:.2f}')
## Accuracy = 0.78
or:
# Accuracy of the predictions
acc = model.score(X_test, y_test)
print(f'Accuracy = {acc:.2f}')
## Accuracy = 0.78
Precision score (the ability of the classifier not to label as positive a sample that is negative):
# Precision score
precision_dtree = metrics.precision_score(y_test, y_pred, pos_label='yes')
print(f'Precision = {precision_dtree:.2f}')
## Precision = 0.93
F1 score (a harmonic mean of the precision and recall):
# F1 score
f1_dtree = metrics.f1_score(y_test, y_pred, pos_label='yes')
print(f'F1 = {f1_dtree:.2f}')
## F1 = 0.62