When working with multiple groups of qualitative data the Tukey’s range test can be used to determine which of the groups’ means are significantly different from another group’s mean:

In the above plot the black dashed lines represent the mean values of the groups. Is the mean of Group 1 significantly different from that of Group 2? And from that of Group 3? How about the mean values of Group 2 and Group 3?

Now, one option for how to address these questions could be to use multiple two-sample t-tests (one test for each pair of groups) - see here for how to do that. However, the problem with this is that every time you do a t-test there is a chance that the result will be a false positive. This is also known as a Type I error and the probability of this occurring is usually set by the researcher at 5%. This means that, if we were to do three t-tests, the chance of this happening would be \(1 - (1 - 0.05)^3 = 14.3\%\) which is very high! There’s an xkcd comic that highlights this problem.

For this reason, whenever there are more than two groups it is expected that an ANOVA (analysis of variance) test is used instead of multiple t-tests as this will keep the Type I error rate at 5% overall. See here for the page on that topic. While this will answer the question of whether or not all groups’ means are the same it will not, unfortunately, tell us which means are different.

This is where Tukey’s range test becomes useful. It should be done after an ANOVA test has returned a significant result in order to tell us which groups are different.

1 Python Packages

The code on this page uses the Statsmodels, Matplotlib, Seaborn and SciPy packages. These can be installed from the terminal with the following commands:

# `python3.11` corresponds to the version of Python you have installed and are using
$ python3.11 -m pip install statsmodels
$ python3.11 -m pip install matplotlib
$ python3.11 -m pip install seaborn
$ python3.11 -m pip install scipy

Once finished, import these packages into your Python script as follows:

from statsmodels import api as sm
from statsmodels.stats.multicomp import pairwise_tukeyhsd
from matplotlib import pyplot as plt
from matplotlib import lines
import seaborn as sns
from scipy import stats as st

2 Example Data

For this tutorial we will use the Travel Mode Choice dataset from Statsmodels.

For documentation on how to work with Statsmodels datasets, see here
For documentation on this dataset in particular, see here
For an example, see this page

This can be loaded into Python as a ‘Dataset’ object. This will contain the data itself as well as metadata about the dataset, but for this example we are only interested in the data. Furthermore, we want it to be in the Pandas data frame format, hence we will use the .load_pandas() method:

# Load the dataset
dataset = sm.datasets.modechoice.load_pandas()
# Extract the data
df = dataset['data']

print(df.head())

##    individual  mode  choice  ttme  invc   invt    gc  hinc  psize
## 0         1.0   1.0     0.0  69.0  59.0  100.0  70.0  35.0    1.0
## 1         1.0   2.0     0.0  34.0  31.0  372.0  71.0  35.0    1.0
## 2         1.0   3.0     0.0  35.0  25.0  417.0  70.0  35.0    1.0
## 3         1.0   4.0     1.0   0.0  10.0  180.0  30.0  35.0    1.0
## 4         2.0   1.0     0.0  64.0  58.0   68.0  68.0  30.0    2.0

The mode column contains the mode of transport used in various trips between Sydney, Canberra and Melbourne in Australia. Currently these are encoded as the numbers 1, 2, 3 and 4, representing ‘air’, ‘train’, ‘bus’ and ‘car’ respectively. We are only going to use the latter three of these groups for this tutorial, so we will decode and filter the data accordingly:

# Decode the mode of transport used
df['mode'] = df['mode'].replace({1: 'Air', 2: 'Train', 3: 'Bus', 4: 'Car'})
# Filter
df = df[df['mode'] != 'Air']

Finally, we are going to look only at the travel time variable in this example - held in the invt (in-vehicle time) column - so let’s extract that:

# Extract the columns we want
cols = ['invt', 'mode']
df = df[cols]

print(df.head())

##     invt   mode
## 1  372.0  Train
## 2  417.0    Bus
## 3  180.0    Car
## 5  354.0  Train
## 6  399.0    Bus

Here is all the data relevant to this example presented in a box plot plus a strip plot:

# Plot
ax = plt.axes()
sns.boxplot(
   df, x='mode', y='invt', color='lightgrey', whis=[0, 100],
   showmeans=True, meanline=True, meanprops={'color': 'black'}
)
sns.stripplot(
   df, x='mode', y='invt',
   color='lightgrey', edgecolor='black', linewidth=1
)
ax.set_title('Travel Mode Choice')
ax.set_ylabel('Travel time (in-vehicle time) for all stages (minutes)')
ax.set_ylim([0, 1500])
ax.set_xlabel('')
handles = [lines.Line2D([0], [0], color='k', linewidth=1, linestyle='--')]
ax.legend(handles, ['Group Means'])

3 Assumptions

Tukey’s range test (along with most t-tests and ANOVA tests) makes the following assumptions:

The observations are independent of each other both within a group and between groups. For our example:
- We can assume that the time taken to complete a journey does not influence the time taken to complete a different journey using the same mode of transport
- We can assume that the time taken to complete a train journey does not affect the time taken to complete a bus journey or a car journey, and vice versa
The data in each group is normally distributed, although if the sample size is large enough then non-normal data can be approximated as normal (given that there are not influential outliers and that the data is not considerably skewed). Rules-of-thumb for what is ‘large enough’ are:
- \(n > 20\) if you only have one group
- \(n > 15\) in each group if you have 2 to 9 groups
- \(n > 20\) in each group if you have 10 to 12 groups
The spread of the data should be roughly equal for all groups. This is known as homogeneity of variance although you can actually look at either variance or standard deviation to assess this. Rules-of-thumb as to what standard deviations \(s_0\), \(s_1\) can be considered to be ‘roughly equal’ are:
- \(0.5 < \frac{s_0}{s_1} < 2\) (source: Wikipedia)
- \(0.9 < \frac{s_0}{s_1} < 1.1\) (source: Codecademy)

Let’s check our sample size:

# Sample sizes
n = df['mode'].value_counts()

print(n)

## mode
## Train    210
## Bus      210
## Car      210
## Name: count, dtype: int64

With 210 samples in each group, only a few outliers and not much skewness (the means and medians of the groups are close to each other) we can assume normality with limited error.

Now let’s check the ratios of the standard deviations:

# Homogeneity of variance (using sample standard deviations)
s = df.groupby('mode')['invt'].std(ddof=1)
for i in range(3):
    for j in range(i + 1, 3):
        ratio = s[i] / s[j]
        print(f'Ratio of standard deviations: {ratio:.2f}')

## Ratio of standard deviations: 0.86
## Ratio of standard deviations: 0.93
## Ratio of standard deviations: 1.09

These are all between 0.5 and 2 (and are almost all between 0.9 and 1.1).

4 Choosing a Statistical Test

We can follow a statistical test selection process:

Our dependent variable - travel time - is continuous because it is made up of measurements that are numbers as opposed to discrete categories or ranks
Our independent variables - which mode of transport was used - is categorical (specifically, it is nominal)
We have three groups for the independent variable because there are three modes of transport under consideration
Given that the assumptions discussed in the previous section hold we are fine to use parametric statistics

Looking at the flowchart below tells us that we should use ANOVA and, if this produces a significant result, Tukey’s range test as well:

5 One-Way ANOVA Using an F-Test

Now that we’ve chosen a hypothesis test we can go ahead and perform it. Firstly, split the data up into one series of data per group:

# Samples
sample_0 = df[df['mode'] == 'Train']['invt']
sample_1 = df[df['mode'] == 'Bus']['invt']
sample_2 = df[df['mode'] == 'Car']['invt']

Then use the one-way ANOVA F-test from SciPy (the “F-test” is the actual test being performed in the background):

# One-way ANOVA
f_statistic, p_value = st.f_oneway(sample_0, sample_1, sample_2)

print(f'One-way ANOVA: F = {f_statistic:.2f}, p = {p_value:.2e}')

## One-way ANOVA: F = 2.62, p = 7.38e-02

We have a significant result (p < 0.05) so we should do a Tukey’s range test as well:

6 Tukey’s Range Test

We will use an alpha value of 10% (there’s no particular reason for this and 5% is more common):

# Tukey's range test
tukey_results = pairwise_tukeyhsd(df['invt'], df['mode'], 0.10)

print(tukey_results)

##  Multiple Comparison of Means - Tukey HSD, FWER=0.10  
## ======================================================
## group1 group2 meandiff p-adj    lower    upper  reject
## ------------------------------------------------------
##    Bus    Car -56.2571 0.0616 -107.3298 -5.1845   True
##    Bus  Train -21.1762 0.6704  -72.2488 29.8965  False
##    Car  Train   35.081 0.3351  -15.9917 86.1536  False
## ------------------------------------------------------

This output shows us that we can reject the null hypothesis that bus and car journeys take the same time, but that we cannot reject this hypothesis for bus and train journeys or for car and train journeys.

⇦ Back