Example 2

This example comes from the Wikipedia page on the Wilcoxon signed-rank test:

group_1 = [110, 122, 125, 120, 140, 124, 123, 137, 135, 145]
group_2 = [125, 115, 130, 140, 140, 115, 140, 125, 140, 135]

statistic, pvalue = stats.wilcoxon(group_1, group_2)

print(f'Test statistic = {statistic}, p = {pvalue:.3f}')

## Test statistic = 18.0, p = 0.633

It’s worth noting that we get the same answer regardless of the order of the groups:

# Swapping group 1 and 2
statistic, pvalue = stats.wilcoxon(group_2, group_1)

print(f'Test statistic = {statistic}, p = {pvalue:.3f}')

## Test statistic = 18.0, p = 0.633

We also get the same answer if we just use the differences between the values instead of the values themselves:

import numpy as np

diffs = np.array(group_2) - np.array(group_1)
statistic, pvalue = stats.wilcoxon(diffs)

print(f'Test statistic = {statistic}, p = {pvalue:.3f}')

## Test statistic = 18.0, p = 0.633

Discrepancies

Confusingly, these answers do not match those given by Wikipedia’s worked example: that test statistic (referred to as \(W\)) has a value of 9 compared to SciPy’s test statistic (referred to as \(T\)) which has a value of 18. The p-values are also different: 0.6113 vs 0.633.

The DATAtab page about the Wilcoxon signed-rank test mentions that there are different ways to calculate the test statistic using the sums of the positive ranks (\(T^+\)) and the sums of the negative ranks (\(T^-\)). It appears that SciPy is using the minimum of the two whereas Wikipedia is using the (signed) sum of the two:

SciPy: \(T = min(T^+,\ T^-)\)
Wikipedia: \(W = T^+ - T^-\)
DATAtab: \(T = T^+\)

In Example 1, the two negative differences that appear in the data (-48 and -67) rank 10th and 14th in magnitude out of the 15 differences respectively, thus \(T^- = 10 + 14 = 24\) which is smaller than \(T^+\) and which is thus used as the value of the test statistic \(T\). Similarly, in Example 2, we have \(T^- = 3 + 4 + 5 + 6 = 18\) and \(T^+ = 1.5 + 1.5 + 7 + 8 + 9 = 27\) so SciPy uses \(T = T^- = 18\) and Wikipedia uses \(W = 27 - 18 = 9\) for the test statistic.

This explains the discrepancy in the values of the test statistics but not in the p-values. Both this answer on Stack Overflow and this online calculator suggest that the solution is \(p = 0.594\) (note that the online calculator uses \(W\) as their symbol for the test statistic but still with a value of 18 - inconsistent with both SciPy and Wikipedia - and that their answer of \(z = 0.5331\) needs to be converted to \(p = 0.594\) via stats.norm.sf(abs(-0.5331)) * 2). This result can be replicated with SciPy by using the “asymptotic” method for calculating the p-value:

statistic, pvalue = stats.wilcoxon(diffs, method='asymptotic')

print(f'Test statistic = {statistic}, p = {pvalue:.3f}')

## Test statistic = 18.0, p = 0.594

This method='asymptotic' option is one of three provided by SciPy, along with method='exact' and method='auto'. In general, 'asymptotic' is better for large sample sizes while 'exact' is better for small sample sizes, but this is something that is affected by the number of ties (multiple pairs of data points with the same numerical difference) and zeros (pairs of data points with the same value). Fortunately, the third option - 'auto' - will automatically choose the best method out of the two for your data, and this is the default option.

In conclusion, it is sufficient to use SciPy’s implementation of the Wilcoxon signed-rank test as has been used in the two examples above. It uses a slightly different test statistic and thus will have a slightly different p-value compared to the method used on the Wikipedia page, but the difference is small and the method is still valid.

⇦ Back

Statistics in Python:Wilcoxon Signed-Rank Test

Example 1

Example 2

Discrepancies

Statistics in Python:
Wilcoxon Signed-Rank Test