⇦ Back

The Wilcoxon signed-rank test is a non-parametric test of the hypothesis that the differences between the values of pairs of data points from two samples is symmetric about zero. In Python, it can be performed with the wilcoxon() function from SciPy.

Example 1

This example is taken from the SciPy documentation which in turn took it from the original Wilcoxon (1945) paper from where this test originates. The difference in height between 15 cross- and self-fertilised corn plants of the same pair are as follows:

# Difference in height between pairs of corn plants
diffs = [6, 8, 14, 16, 23, 24, 28, 29, 41, -48, 49, 56, 60, -67, 75]

The fact that most of these differences are positive indicates that the cross-fertilized plants are generally taller. Let’s see if this observation is statistically significant:

from scipy import stats

statistic, pvalue = stats.wilcoxon(diffs)

print(f'Test statistic = {statistic}, p = {pvalue:.3f}')
## Test statistic = 24.0, p = 0.041

The p-value is less than 0.05 so we would reject the null hypothesis at a confidence level of 5% and conclude that there is a difference in height between the groups of corn plants.

Note that the test statistic is represented by a capital T in the source code.

Example 2

This example comes from the Wikipedia page on the Wilcoxon signed-rank test:

group_1 = [110, 122, 125, 120, 140, 124, 123, 137, 135, 145]
group_2 = [125, 115, 130, 140, 140, 115, 140, 125, 140, 135]

statistic, pvalue = stats.wilcoxon(group_1, group_2)

print(f'Test statistic = {statistic}, p = {pvalue:.3f}')
## Test statistic = 18.0, p = 0.633

It’s worth noting that we get the same answer regardless of the order of the groups:

# Swapping group 1 and 2
statistic, pvalue = stats.wilcoxon(group_2, group_1)

print(f'Test statistic = {statistic}, p = {pvalue:.3f}')
## Test statistic = 18.0, p = 0.633

We also get the same answer if we just use the differences between the values instead of the values themselves:

import numpy as np

diffs = np.array(group_2) - np.array(group_1)
statistic, pvalue = stats.wilcoxon(diffs)

print(f'Test statistic = {statistic}, p = {pvalue:.3f}')
## Test statistic = 18.0, p = 0.633

Discrepancies

Confusingly, these answers do not match those given by Wikipedia’s worked example: that test statistic (referred to as \(W\)) has a value of 9 compared to SciPy’s test statistic (referred to as \(T\)) which has a value of 18. The p-values are also different: 0.6113 vs 0.633.

The DATAtab page about the Wilcoxon signed-rank test mentions that there are different ways to calculate the test statistic using the sums of the positive ranks (\(T^+\)) and the sums of the negative ranks (\(T^-\)). It appears that SciPy is using the minimum of the two whereas Wikipedia is using the (signed) sum of the two:

  • SciPy: \(T = min(T^+,\ T^-)\)
  • Wikipedia: \(W = T^+ - T^-\)
  • DATAtab: \(T = T^+\)

In Example 1, the two negative differences that appear in the data (-48 and -67) rank 10th and 14th in magnitude out of the 15 differences respectively, thus \(T^- = 10 + 14 = 24\) which is smaller than \(T^+\) and which is thus used as the value of the test statistic \(T\). Similarly, in Example 2, we have \(T^- = 3 + 4 + 5 + 6 = 18\) and \(T^+ = 1.5 + 1.5 + 7 + 8 + 9 = 27\) so SciPy uses \(T = T^- = 18\) and Wikipedia uses \(W = 27 - 18 = 9\) for the test statistic.

This explains the discrepancy in the values of the test statistics but not in the p-values. Both this answer on Stack Overflow and this online calculator suggest that the solution is \(p = 0.594\) (note that the online calculator uses \(W\) as their symbol for the test statistic but still with a value of 18 - inconsistent with both SciPy and Wikipedia - and that their answer of \(z = 0.5331\) needs to be converted to \(p = 0.594\) via stats.norm.sf(abs(-0.5331)) * 2). This result can be replicated with SciPy by using the “asymptotic” method for calculating the p-value:

statistic, pvalue = stats.wilcoxon(diffs, method='asymptotic')

print(f'Test statistic = {statistic}, p = {pvalue:.3f}')
## Test statistic = 18.0, p = 0.594

This method='asymptotic' option is one of three provided by SciPy, along with method='exact' and method='auto'. In general, 'asymptotic' is better for large sample sizes while 'exact' is better for small sample sizes, but this is something that is affected by the number of ties (multiple pairs of data points with the same numerical difference) and zeros (pairs of data points with the same value). Fortunately, the third option - 'auto' - will automatically choose the best method out of the two for your data, and this is the default option.

In conclusion, it is sufficient to use SciPy’s implementation of the Wilcoxon signed-rank test as has been used in the two examples above. It uses a slightly different test statistic and thus will have a slightly different p-value compared to the method used on the Wikipedia page, but the difference is small and the method is still valid.

⇦ Back