When working with multiple groups of qualitative data the Tukey’s range test can be used to determine which of the groups’ means are significantly different from another group’s mean:
In the above plot the black dashed lines represent the mean values of the groups. Is the mean of Group 1 significantly different from that of Group 2? And from that of Group 3? How about the mean values of Group 2 and Group 3?
Now, one option for how to address these questions could be to use multiple two-sample t-tests (one test for each pair of groups) - see here for how to do that. However, the problem with this is that every time you do a t-test there is a chance that the result will be a false positive. This is also known as a Type I error and the probability of this occurring is usually set by the researcher at 5%. This means that, if we were to do three t-tests, the chance of this happening would be \(1 - (1 - 0.05)^3 = 14.3\%\) which is very high! There’s an xkcd comic that highlights this problem.
For this reason, whenever there are more than two groups it is expected that an ANOVA (analysis of variance) test is used instead of multiple t-tests as this will keep the Type I error rate at 5% overall. See here for the page on that topic. While this will answer the question of whether or not all groups’ means are the same it will not, unfortunately, tell us which means are different.
This is where Tukey’s range test becomes useful. It should be done after an ANOVA test has returned a significant result in order to tell us which groups are different.
The code on this page uses the Statsmodels, Matplotlib, Seaborn and SciPy packages. These can be installed from the terminal with the following commands:
# `python3.11` corresponds to the version of Python you have installed and are using
$ python3.11 -m pip install statsmodels
$ python3.11 -m pip install matplotlib
$ python3.11 -m pip install seaborn
$ python3.11 -m pip install scipy
Once finished, import these packages into your Python script as follows:
from statsmodels import api as sm
from statsmodels.stats.multicomp import pairwise_tukeyhsd
from matplotlib import pyplot as plt
from matplotlib import lines
import seaborn as sns
from scipy import stats as st
For this tutorial we will use the Travel Mode Choice dataset from Statsmodels.
This can be loaded into Python as a ‘Dataset’ object. This will contain the data itself as well as metadata about the dataset, but for this example we are only interested in the data. Furthermore, we want it to be in the Pandas data frame format, hence we will use the .load_pandas()
method:
# Load the dataset
dataset = sm.datasets.modechoice.load_pandas()
# Extract the data
df = dataset['data']
print(df.head())
## individual mode choice ttme invc invt gc hinc psize
## 0 1.0 1.0 0.0 69.0 59.0 100.0 70.0 35.0 1.0
## 1 1.0 2.0 0.0 34.0 31.0 372.0 71.0 35.0 1.0
## 2 1.0 3.0 0.0 35.0 25.0 417.0 70.0 35.0 1.0
## 3 1.0 4.0 1.0 0.0 10.0 180.0 30.0 35.0 1.0
## 4 2.0 1.0 0.0 64.0 58.0 68.0 68.0 30.0 2.0
The mode
column contains the mode of transport used in various trips between Sydney, Canberra and Melbourne in Australia. Currently these are encoded as the numbers 1, 2, 3 and 4, representing ‘air’, ‘train’, ‘bus’ and ‘car’ respectively. We are only going to use the latter three of these groups for this tutorial, so we will decode and filter the data accordingly:
# Decode the mode of transport used
df['mode'] = df['mode'].replace({1: 'Air', 2: 'Train', 3: 'Bus', 4: 'Car'})
# Filter
df = df[df['mode'] != 'Air']
Finally, we are going to look only at the travel time variable in this example - held in the invt
(in-vehicle time) column - so let’s extract that:
# Extract the columns we want
cols = ['invt', 'mode']
df = df[cols]
print(df.head())
## invt mode
## 1 372.0 Train
## 2 417.0 Bus
## 3 180.0 Car
## 5 354.0 Train
## 6 399.0 Bus
Here is all the data relevant to this example presented in a box plot plus a strip plot:
# Plot
ax = plt.axes()
sns.boxplot(
df, x='mode', y='invt', color='lightgrey', whis=[0, 100],
showmeans=True, meanline=True, meanprops={'color': 'black'}
)
sns.stripplot(
df, x='mode', y='invt',
color='lightgrey', edgecolor='black', linewidth=1
)
ax.set_title('Travel Mode Choice')
ax.set_ylabel('Travel time (in-vehicle time) for all stages (minutes)')
ax.set_ylim([0, 1500])
ax.set_xlabel('')
handles = [lines.Line2D([0], [0], color='k', linewidth=1, linestyle='--')]
ax.legend(handles, ['Group Means'])
Tukey’s range test (along with most t-tests and ANOVA tests) makes the following assumptions:
Let’s check our sample size:
# Sample sizes
n = df['mode'].value_counts()
print(n)
## mode
## Train 210
## Bus 210
## Car 210
## Name: count, dtype: int64
With 210 samples in each group, only a few outliers and not much skewness (the means and medians of the groups are close to each other) we can assume normality with limited error.
Now let’s check the ratios of the standard deviations:
# Homogeneity of variance (using sample standard deviations)
s = df.groupby('mode')['invt'].std(ddof=1)
for i in range(3):
for j in range(i + 1, 3):
ratio = s[i] / s[j]
print(f'Ratio of standard deviations: {ratio:.2f}')
## Ratio of standard deviations: 0.86
## Ratio of standard deviations: 0.93
## Ratio of standard deviations: 1.09
These are all between 0.5 and 2 (and are almost all between 0.9 and 1.1).
We can follow a statistical test selection process:
Looking at the flowchart below tells us that we should use ANOVA and, if this produces a significant result, Tukey’s range test as well:
Now that we’ve chosen a hypothesis test we can go ahead and perform it. Firstly, split the data up into one series of data per group:
# Samples
sample_0 = df[df['mode'] == 'Train']['invt']
sample_1 = df[df['mode'] == 'Bus']['invt']
sample_2 = df[df['mode'] == 'Car']['invt']
Then use the one-way ANOVA F-test from SciPy (the “F-test” is the actual test being performed in the background):
# One-way ANOVA
f_statistic, p_value = st.f_oneway(sample_0, sample_1, sample_2)
print(f'One-way ANOVA: F = {f_statistic:.2f}, p = {p_value:.2e}')
## One-way ANOVA: F = 2.62, p = 7.38e-02
We have a significant result (p < 0.05) so we should do a Tukey’s range test as well:
We will use an alpha value of 10% (there’s no particular reason for this and 5% is more common):
# Tukey's range test
tukey_results = pairwise_tukeyhsd(df['invt'], df['mode'], 0.10)
print(tukey_results)
## Multiple Comparison of Means - Tukey HSD, FWER=0.10
## ======================================================
## group1 group2 meandiff p-adj lower upper reject
## ------------------------------------------------------
## Bus Car -56.2571 0.0616 -107.3298 -5.1845 True
## Bus Train -21.1762 0.6704 -72.2488 29.8965 False
## Car Train 35.081 0.3351 -15.9917 86.1536 False
## ------------------------------------------------------
This output shows us that we can reject the null hypothesis that bus and car journeys take the same time, but that we cannot reject this hypothesis for bus and train journeys or for car and train journeys.