⇦ Back

Spearman’s rank correlation coefficient is the correlation between the rank of your data’s y-values and the rank of your data’s x-values.

The Greek letter ρ (rho) is used for the coefficient.

It can be calculated from data where the dependent variable (y-values) and independent variable (x-values) are both continuous, where a confounding factor exists (ie the independent variable is not truly independent) and where the parametric assumptions are not met:

Python Packages:

The code on this page uses the pandas, matplotlib, numpy, scipy and pingouin packages. These can be installed from the terminal with:

$ python3.11 -m pip install pandas
$ python3.11 -m pip install matplotlib
$ python3.11 -m pip install numpy
$ python3.11 -m pip install scipy
$ python3.11 -m pip install pingouin

where python3.11 corresponds to the version of Python you have installed and are using.

1 Example Data

Using the example from Wikipedia: peoples’ IQs and the numbers of hours of TV they watch per week:

import pandas as pd

dct = {
    'IQ': [106, 100, 86, 101, 99, 103, 97, 113, 112, 110],
    'Hours of TV': [7, 27, 2, 50, 28, 29, 20, 12, 6, 17],
}
df = pd.DataFrame(dct)

print(df)
##     IQ  Hours of TV
## 0  106            7
## 1  100           27
## 2   86            2
## 3  101           50
## 4   99           28
## 5  103           29
## 6   97           20
## 7  113           12
## 8  112            6
## 9  110           17

Presented visually:

import matplotlib.pyplot as plt

# Formatting options for plots
A = 6  # Want figures to be A6
plt.rc('figure', figsize=[46.82 * .5**(.5 * A), 33.11 * .5**(.5 * A)])
plt.rc('text', usetex=True)  # Use LaTeX
plt.rc('font', family='serif')  # Use a serif font
plt.rc('text.latex', preamble=r'\usepackage{textgreek}')  # Load Greek letters

# Create plot
ax = plt.axes()
x = df['IQ']
y = df['Hours of TV']
ax.scatter(x, y, c='k', s=20, alpha=0.6, marker='o')
# Labels
ax.set_title('Fake Data to Demonstrate Correlation')
ax.set_ylabel('Hours of TV Watched per Week')
ax.set_xlabel('IQ')
# Show plot
plt.show()

2 Manual Calculation

As is done on the Wikipedia page:

# Rank the values in the column (ties receive the average rank of the group)
df['rank x_i'] = df['IQ'].rank()
df['rank y_i'] = df['Hours of TV'].rank()
# Differences between ranks
df['d_i'] = df['rank x_i'] - df['rank y_i']
# Squared differences between ranks
df['d_i**2'] = df['d_i']**2

print(df)
##     IQ  Hours of TV  rank x_i  rank y_i  d_i  d_i**2
## 0  106            7       7.0       3.0  4.0    16.0
## 1  100           27       4.0       7.0 -3.0     9.0
## 2   86            2       1.0       1.0  0.0     0.0
## 3  101           50       5.0      10.0 -5.0    25.0
## 4   99           28       3.0       8.0 -5.0    25.0
## 5  103           29       6.0       9.0 -3.0     9.0
## 6   97           20       2.0       6.0 -4.0    16.0
## 7  113           12      10.0       4.0  6.0    36.0
## 8  112            6       9.0       2.0  7.0    49.0
## 9  110           17       8.0       5.0  3.0     9.0

The above table matches what is shown in the example. Now we can use the formula to calculate ρ:

# Sum of squared differences
x = df['d_i**2'].sum()
# Sample size
n = len(df)
# Spearman's rank correlation coefficient, ρ
rho = 1 - (6 * x) / (n * (n**2 - 1))

print(f'ρ = {rho:.4f}')
## ρ = -0.1758

…and the associated p-value:

import numpy as np
import scipy.stats as st

# t-statistics
t = rho * np.sqrt((n - 2) / (1 - rho**2))
# Degrees of freedom
dof = n - 2
# Number of tails
tails = 2
# Significance
p = st.t.cdf(t, df=dof) * tails

print(f'p = {p:.4f}')
## p = 0.6272

These match the values in the example.

3 Using SciPy

Of course, the above can be done quicker using packages:

import scipy.stats as st

rho, p = st.spearmanr(df['IQ'], df['Hours of TV'])

print(f'ρ = {rho:.4f}; p = {p:.4f}')
## ρ = -0.1758; p = 0.6272

4 Using Pandas

import pandas as pd

col = ['IQ', 'Hours of TV']

print(df[col].corr(method='spearman'))
##                    IQ  Hours of TV
## IQ           1.000000    -0.175758
## Hours of TV -0.175758     1.000000

There’s no way to get the p-value directly using this method (you can, of course, calculate it manually as shown above).

5 Using Pingouin

import pingouin as pg

print(pg.corr(df['IQ'], df['Hours of TV'], method='spearman'))
##            n         r          CI95%     p-val    power
## spearman  10 -0.175758  [-0.73, 0.51]  0.627188  0.07705

This is the quickest way to get the confidence interval.

⇦ Back