Spearman’s rank correlation coefficient is the correlation between the rank of your data’s y-values and the rank of your data’s x-values.
The Greek letter ρ (rho) is used for the coefficient.
It can be calculated from data where the dependent variable (y-values) and independent variable (x-values) are both continuous, where a confounding factor exists (ie the independent variable is not truly independent) and where the parametric assumptions are not met:
Python Packages:
The code on this page uses the pandas
, matplotlib
, numpy
, scipy
and pingouin
packages. These can be installed from the terminal with:
$ python3.11 -m pip install pandas
$ python3.11 -m pip install matplotlib
$ python3.11 -m pip install numpy
$ python3.11 -m pip install scipy
$ python3.11 -m pip install pingouin
where python3.11
corresponds to the version of Python you have installed and are using.
Using the example from Wikipedia: peoples’ IQs and the numbers of hours of TV they watch per week:
import pandas as pd
dct = {
'IQ': [106, 100, 86, 101, 99, 103, 97, 113, 112, 110],
'Hours of TV': [7, 27, 2, 50, 28, 29, 20, 12, 6, 17],
}
df = pd.DataFrame(dct)
print(df)
## IQ Hours of TV
## 0 106 7
## 1 100 27
## 2 86 2
## 3 101 50
## 4 99 28
## 5 103 29
## 6 97 20
## 7 113 12
## 8 112 6
## 9 110 17
Presented visually:
import matplotlib.pyplot as plt
# Formatting options for plots
A = 6 # Want figures to be A6
plt.rc('figure', figsize=[46.82 * .5**(.5 * A), 33.11 * .5**(.5 * A)])
plt.rc('text', usetex=True) # Use LaTeX
plt.rc('font', family='serif') # Use a serif font
plt.rc('text.latex', preamble=r'\usepackage{textgreek}') # Load Greek letters
# Create plot
ax = plt.axes()
x = df['IQ']
y = df['Hours of TV']
ax.scatter(x, y, c='k', s=20, alpha=0.6, marker='o')
# Labels
ax.set_title('Fake Data to Demonstrate Correlation')
ax.set_ylabel('Hours of TV Watched per Week')
ax.set_xlabel('IQ')
# Show plot
plt.show()
As is done on the Wikipedia page:
# Rank the values in the column (ties receive the average rank of the group)
df['rank x_i'] = df['IQ'].rank()
df['rank y_i'] = df['Hours of TV'].rank()
# Differences between ranks
df['d_i'] = df['rank x_i'] - df['rank y_i']
# Squared differences between ranks
df['d_i**2'] = df['d_i']**2
print(df)
## IQ Hours of TV rank x_i rank y_i d_i d_i**2
## 0 106 7 7.0 3.0 4.0 16.0
## 1 100 27 4.0 7.0 -3.0 9.0
## 2 86 2 1.0 1.0 0.0 0.0
## 3 101 50 5.0 10.0 -5.0 25.0
## 4 99 28 3.0 8.0 -5.0 25.0
## 5 103 29 6.0 9.0 -3.0 9.0
## 6 97 20 2.0 6.0 -4.0 16.0
## 7 113 12 10.0 4.0 6.0 36.0
## 8 112 6 9.0 2.0 7.0 49.0
## 9 110 17 8.0 5.0 3.0 9.0
The above table matches what is shown in the example. Now we can use the formula to calculate ρ:
# Sum of squared differences
x = df['d_i**2'].sum()
# Sample size
n = len(df)
# Spearman's rank correlation coefficient, ρ
rho = 1 - (6 * x) / (n * (n**2 - 1))
print(f'ρ = {rho:.4f}')
## ρ = -0.1758
…and the associated p-value:
import numpy as np
import scipy.stats as st
# t-statistics
t = rho * np.sqrt((n - 2) / (1 - rho**2))
# Degrees of freedom
dof = n - 2
# Number of tails
tails = 2
# Significance
p = st.t.cdf(t, df=dof) * tails
print(f'p = {p:.4f}')
## p = 0.6272
These match the values in the example.
Of course, the above can be done quicker using packages:
import scipy.stats as st
rho, p = st.spearmanr(df['IQ'], df['Hours of TV'])
print(f'ρ = {rho:.4f}; p = {p:.4f}')
## ρ = -0.1758; p = 0.6272
import pandas as pd
col = ['IQ', 'Hours of TV']
print(df[col].corr(method='spearman'))
## IQ Hours of TV
## IQ 1.000000 -0.175758
## Hours of TV -0.175758 1.000000
There’s no way to get the p-value directly using this method (you can, of course, calculate it manually as shown above).
import pingouin as pg
print(pg.corr(df['IQ'], df['Hours of TV'], method='spearman'))
## n r CI95% p-val power
## spearman 10 -0.175758 [-0.73, 0.51] 0.627188 0.07705
This is the quickest way to get the confidence interval.