The Mann–Whitney U test can be used to compare the means of two groups of numerical data which are independent of each other.
The code on this page uses the numpy
, matplotlib
and scipy
packages. These can be installed from the terminal with:
$ python3.11 -m pip install numpy
$ python3.11 -m pip install matplotlib
$ python3.11 -m pip install scipy
where python3.11
corresponds to the version of Python you have installed and are using.
This example is taken from Section 12.4 of Probability and Statistics for Engineers and Scientists by S M Ross (4th edition, 2009). A fictional experiment designed to test the effectiveness of two different anti-corrosion treatments for metal yielded the following results:
where the values represent the maximum depth of pits worn into the metal in thousands of a centimetre. This can be hard-codes as a dictionary of lists in Python as follows:
# Raw data
data = {
'Treatment 1': [65.2, 67.1, 69.4, 78.2, 74, 80.3],
'Treatment 2': [59.4, 72.1, 68, 66.2, 58.5]
}
Here’s what the data looks like graphically:
import matplotlib.pyplot as plt
import numpy as np
#
# Plot
#
# Boxplot
meanprops = {'color': 'b'}
bp = plt.boxplot(data.values(), showmeans=True, meanline=True, meanprops=meanprops)
# Scatterplot for treatment 1
y = data['Treatment 1']
plt.scatter(np.zeros(len(y)) + 1, y, c='k')
# Scatterplot for treatment 2
y = data['Treatment 2']
plt.scatter(np.ones(len(y)) + 1, y, c='k')
# Tile and labels
plt.title('The Effectiveness of Anti-Corrosion Treatments')
plt.xlabel('Treatment')
plt.ylabel('Maximum depth of pits [10e-3 cm]')
# Legend
plt.legend([bp['medians'][0], bp['means'][0]], ['Medians', 'Means'])
Looking at the data plotted above, the question we want to ask is:
Is the average amount of wear different between samples given different anti-corrosion treatments?
If it is, it would suggest that that the effectiveness of the treatments is different.
Before we can answer this question, we need to decide if we should use parametric or non-parametric statistics. Parametric statistics are based on assumptions about the data’s distribution (whereas non-parametric statistics are not) and reasons to choose them include:
As the sample size in this example dataset is small, we should choose to do non-parametric statistics and tweak our research question so that it becomes “is the median amount of wear different?”1. In other words, are the orange lines in the plot above (the medians) at different heights?
1when doing parametric statistics you should use the mean; when doing non-parametric statistics you should use the median.
In order to answer the research question (“is the median amount of wear different?”) we now need to formulate it properly as a null hypothesis with a corresponding alternate hypothesis:
Now that we have hypotheses, we can follow a statistical test selection process:
Looking at the flowchart below tells us that we should thus be using the Mann-Whitney U test to test our hypothesis:
The SciPy package makes it incredibly easy to perform the test: just use the mannwhitneyu()
function!
from scipy.stats import mannwhitneyu
# Perform the Mann-Whitney U test
statistic, pvalue = mannwhitneyu(data['Treatment 1'], data['Treatment 2'])
print(f'Mann-Whitney U test: U = {int(statistic)}, p = {pvalue:6.4f}')
## Mann-Whitney U test: U = 24, p = 0.1255
This p-value is large (>0.05), which means that we fail to reject H0 and conclude that there is not a statistically significant difference between the two treatments (p > 0.05).
Often, you will see this type of result reported using asterisks to indicate the significance level (α) associated with it. Additionally, if the p-value is very small it’s usually just reported as “<0.001” rather than as an exact value. Here are functions to add in this formatting (not that it’s relevant in this example!):
def get_significance(p):
"""Returns the significance of a p-values as a string of stars."""
if p <= 0.001:
return ' (***)'
elif p <= 0.01:
return ' (**)'
elif p <= 0.05:
return ' (*)'
elif p <= 0.1:
return ' (.)'
else:
return ''
def round_p_value(p):
"""Round a small p-value so that it is human-readable."""
if p < 0.001:
return '<0.001'
else:
return f'{p:5.3}'
p_rounded = round_p_value(pvalue)
significance = get_significance(pvalue)
print(f'The p-value is {p_rounded}{significance}')
## The p-value is 0.126
This page showed one method of calculating the Mann-Whitney test which, in my opinion, is the best for practical use. However, if you’re interested in doing a deep dive into how the test works and variations on the method used by SciPy’s mannwhitneyu()
by default, take a look here.
If you’re interested in adding significance bars to boxplots after having used the Mann-Whitney test to find a significant difference, look over here.