The Wilcoxon signed-rank test is a non-parametric
test of the hypothesis that the differences between the values of pairs
of data points from two samples is symmetric about zero. In Python, it
can be performed with the wilcoxon()
function from
SciPy.
This example is taken from the SciPy documentation which in turn took it from the original Wilcoxon (1945) paper from where this test originates. The difference in height between 15 cross- and self-fertilised corn plants of the same pair are as follows:
# Difference in height between pairs of corn plants
diffs = [6, 8, 14, 16, 23, 24, 28, 29, 41, -48, 49, 56, 60, -67, 75]
The fact that most of these differences are positive indicates that the cross-fertilized plants are generally taller. Let’s see if this observation is statistically significant:
from scipy import stats
statistic, pvalue = stats.wilcoxon(diffs)
print(f'Test statistic = {statistic}, p = {pvalue:.3f}')
## Test statistic = 24.0, p = 0.041
The p-value is less than 0.05 so we would reject the null hypothesis at a confidence level of 5% and conclude that there is a difference in height between the groups of corn plants.
Note that the test statistic is represented by a capital T in the source code.
This example comes from the Wikipedia page on the Wilcoxon signed-rank test:
group_1 = [110, 122, 125, 120, 140, 124, 123, 137, 135, 145]
group_2 = [125, 115, 130, 140, 140, 115, 140, 125, 140, 135]
statistic, pvalue = stats.wilcoxon(group_1, group_2)
print(f'Test statistic = {statistic}, p = {pvalue:.3f}')
## Test statistic = 18.0, p = 0.633
It’s worth noting that we get the same answer regardless of the order of the groups:
# Swapping group 1 and 2
statistic, pvalue = stats.wilcoxon(group_2, group_1)
print(f'Test statistic = {statistic}, p = {pvalue:.3f}')
## Test statistic = 18.0, p = 0.633
We also get the same answer if we just use the differences between the values instead of the values themselves:
import numpy as np
diffs = np.array(group_2) - np.array(group_1)
statistic, pvalue = stats.wilcoxon(diffs)
print(f'Test statistic = {statistic}, p = {pvalue:.3f}')
## Test statistic = 18.0, p = 0.633
Confusingly, these answers do not match those given by Wikipedia’s worked example: that test statistic (referred to as \(W\)) has a value of 9 compared to SciPy’s test statistic (referred to as \(T\)) which has a value of 18. The p-values are also different: 0.6113 vs 0.633.
The DATAtab page about the Wilcoxon signed-rank test mentions that there are different ways to calculate the test statistic using the sums of the positive ranks (\(T^+\)) and the sums of the negative ranks (\(T^-\)). It appears that SciPy is using the minimum of the two whereas Wikipedia is using the (signed) sum of the two:
In Example 1, the two negative differences that appear in the data (-48 and -67) rank 10th and 14th in magnitude out of the 15 differences respectively, thus \(T^- = 10 + 14 = 24\) which is smaller than \(T^+\) and which is thus used as the value of the test statistic \(T\). Similarly, in Example 2, we have \(T^- = 3 + 4 + 5 + 6 = 18\) and \(T^+ = 1.5 + 1.5 + 7 + 8 + 9 = 27\) so SciPy uses \(T = T^- = 18\) and Wikipedia uses \(W = 27 - 18 = 9\) for the test statistic.
This explains the discrepancy in the values of the test statistics
but not in the p-values. Both this
answer on Stack Overflow and this
online calculator suggest that the solution is \(p = 0.594\) (note that the online
calculator uses \(W\) as their symbol
for the test statistic but still with a value of 18 - inconsistent with
both SciPy and Wikipedia - and that their answer of \(z = 0.5331\) needs to be converted to \(p = 0.594\) via
stats.norm.sf(abs(-0.5331)) * 2
). This result can be
replicated with SciPy by using the “asymptotic” method for calculating
the p-value:
statistic, pvalue = stats.wilcoxon(diffs, method='asymptotic')
print(f'Test statistic = {statistic}, p = {pvalue:.3f}')
## Test statistic = 18.0, p = 0.594
This method='asymptotic'
option is one of three provided
by SciPy, along with method='exact'
and
method='auto'
. In general, 'asymptotic'
is
better for large sample sizes while 'exact'
is better for
small sample sizes, but this is something that is affected by the number
of ties (multiple pairs of data points with the same numerical
difference) and zeros (pairs of data points with the same value).
Fortunately, the third option - 'auto'
- will
automatically choose the best method out of the two for your
data, and this is the default option.
In conclusion, it is sufficient to use SciPy’s implementation of the Wilcoxon signed-rank test as has been used in the two examples above. It uses a slightly different test statistic and thus will have a slightly different p-value compared to the method used on the Wikipedia page, but the difference is small and the method is still valid.