⇦ Back

The Mann–Whitney U test can be used to compare the means of two groups of numerical data which are independent of each other.

The code on this page uses the numpy, matplotlib and scipy packages. These can be installed from the terminal with:

$ python3.11 -m pip install numpy
$ python3.11 -m pip install matplotlib
$ python3.11 -m pip install scipy

where python3.11 corresponds to the version of Python you have installed and are using.

1 Example Data

This example is taken from Section 12.4 of Probability and Statistics for Engineers and Scientists by S M Ross (4th edition, 2009). A fictional experiment designed to test the effectiveness of two different anti-corrosion treatments for metal yielded the following results:

  • Treatment 1: 65.2, 67.1, 69.4, 78.2, 74, 80.3
  • Treatment 2: 59.4, 72.1, 68, 66.2, 58.5

where the values represent the maximum depth of pits worn into the metal in thousands of a centimetre. This can be hard-codes as a dictionary of lists in Python as follows:

# Raw data
data = {
    'Treatment 1': [65.2, 67.1, 69.4, 78.2, 74, 80.3],
    'Treatment 2': [59.4, 72.1, 68, 66.2, 58.5]
}

Here’s what the data looks like graphically:

import matplotlib.pyplot as plt
import numpy as np

#
# Plot
#
# Boxplot
meanprops = {'color': 'b'}
bp = plt.boxplot(data.values(), showmeans=True, meanline=True, meanprops=meanprops)
# Scatterplot for treatment 1
y = data['Treatment 1']
plt.scatter(np.zeros(len(y)) + 1, y, c='k')
# Scatterplot for treatment 2
y = data['Treatment 2']
plt.scatter(np.ones(len(y)) + 1, y, c='k')
# Tile and labels
plt.title('The Effectiveness of Anti-Corrosion Treatments')
plt.xlabel('Treatment')
plt.ylabel('Maximum depth of pits [10e-3 cm]')
# Legend
plt.legend([bp['medians'][0], bp['means'][0]], ['Medians', 'Means'])

2 Choose a Statistical Test

Looking at the data plotted above, the question we want to ask is:

Is the average amount of wear different between samples given different anti-corrosion treatments?

If it is, it would suggest that that the effectiveness of the treatments is different.

2.1 Parametric vs Non-Parametric

Before we can answer this question, we need to decide if we should use parametric or non-parametric statistics. Parametric statistics are based on assumptions about the data’s distribution (whereas non-parametric statistics are not) and reasons to choose them include:

  • Your data is normally distributed
  • Your data is not normally distributed but the sample size is large:
    • n > 20 if you only have one group
    • n > 15 in each group if you have 2 to 9 groups
    • n > 20 in each group if you have 10 to 12 groups
  • The spread of data in each group is different
  • The data is skewed (in other words, the median is more representative of the central tendency than the mean)
  • There is a need to increase the statistical power (parametric tests usually have more power and so are more likely to detect a significant difference when one truly exists)

As the sample size in this example dataset is small, we should choose to do non-parametric statistics and tweak our research question so that it becomes “is the median amount of wear different?”1. In other words, are the orange lines in the plot above (the medians) at different heights?

1when doing parametric statistics you should use the mean; when doing non-parametric statistics you should use the median.

2.2 Set a Hypothesis

In order to answer the research question (“is the median amount of wear different?”) we now need to formulate it properly as a null hypothesis with a corresponding alternate hypothesis:

  • H0: true median amount of wear is the same for both treatments
  • H1: true median amount of wear is not the same for both treatments

2.3 Choose a Test

Now that we have hypotheses, we can follow a statistical test selection process:

  • Our data is continuous because it is made up of measurements that are numbers (not discrete categories or ranks)
  • We are interested in the difference between the amount of wear in one group compared to the other (not the relationship between the amount of wear and something else)
  • Specifically, we are interested in the difference of the average (median) amount of wear (not the variance in the amount of wear)
  • We have two groups because there are two treatments being compared
  • As discussed above, we are doing non-parametric statistics
  • Our measurements are independent of each other: the amount that one sample corrodes does not affect how much another sample corrodes

Looking at the flowchart below tells us that we should thus be using the Mann-Whitney U test to test our hypothesis:

3 Mann-Whitney U Test

The SciPy package makes it incredibly easy to perform the test: just use the mannwhitneyu() function!

from scipy.stats import mannwhitneyu

# Perform the Mann-Whitney U test
statistic, pvalue = mannwhitneyu(data['Treatment 1'], data['Treatment 2'])

print(f'Mann-Whitney U test: U = {int(statistic)}, p = {pvalue:6.4f}')
## Mann-Whitney U test: U = 24, p = 0.1255

3.1 Interpreting the Result

This p-value is large (>0.05), which means that we fail to reject H0 and conclude that there is not a statistically significant difference between the two treatments (p > 0.05).

Often, you will see this type of result reported using asterisks to indicate the significance level (α) associated with it. Additionally, if the p-value is very small it’s usually just reported as “<0.001” rather than as an exact value. Here are functions to add in this formatting (not that it’s relevant in this example!):

def get_significance(p):
    """Returns the significance of a p-values as a string of stars."""
    if p <= 0.001:
        return ' (***)'
    elif p <= 0.01:
        return ' (**)'
    elif p <= 0.05:
        return ' (*)'
    elif p <= 0.1:
        return ' (.)'
    else:
        return ''


def round_p_value(p):
    """Round a small p-value so that it is human-readable."""
    if p < 0.001:
        return '<0.001'
    else:
        return f'{p:5.3}'


p_rounded = round_p_value(pvalue)
significance = get_significance(pvalue)
print(f'The p-value is {p_rounded}{significance}')
## The p-value is 0.126

4 Can’t Get Enough?

This page showed one method of calculating the Mann-Whitney test which, in my opinion, is the best for practical use. However, if you’re interested in doing a deep dive into how the test works and variations on the method used by SciPy’s mannwhitneyu() by default, take a look here.

If you’re interested in adding significance bars to boxplots after having used the Mann-Whitney test to find a significant difference, look over here.

⇦ Back