The Pearson Correlation Coefficient, r, can be calculated from data where the dependent and independent variables are both continuous, where a confounding factor exists (ie the independent variable is not truly independent) and where the parametric assumptions are met.
Python Packages:
The code on this page uses the pandas
, matplotlib
, scipy
, and numpy
packages. These can be installed from the terminal with:
$ python3.11 -m pip install pandas
$ python3.11 -m pip install matplotlib
$ python3.11 -m pip install scipy
$ python3.11 -m pip install numpy
where python3.11
corresponds to the version of Python you have installed and are using.
Let’s use the following example data set from Giavarina’s paper.1 Imagine that a set of objects were each measured twice - once using ‘Method A’ and once using ‘Method B’ - giving the two lists of measurements shown below:
import pandas as pd
dct = {
'Method A': [
1.0, 5.0, 10.0, 20.0, 50.0, 40.0, 50.0, 60.0, 70.0, 80.0, 90.0, 100.0,
150.0, 200.0, 250.0, 300.0, 350.0, 400.0, 450.0, 500.0, 550.0, 600.0,
650.0, 700.0, 750.0, 800.0, 850.0, 900.0, 950.0, 1000.0
],
'Method B': [
8.0, 16.0, 30.0, 24.0, 39.0, 54.0, 40.0, 68.0, 72.0, 62.0, 122.0, 80.0,
181.0, 259.0, 275.0, 380.0, 320.0, 434.0, 479.0, 587.0, 626.0, 648.0,
738.0, 766.0, 793.0, 851.0, 871.0, 957.0, 1001.0, 960.0
]
}
df = pd.DataFrame(dct)
These can be visualised on a scatter plot:
import matplotlib.pyplot as plt
# Formatting options for plots
A = 5 # Want figures to be A5
plt.rc('figure', figsize=[46.82 * .5**(.5 * A), 33.11 * .5**(.5 * A)])
plt.rc('text', usetex=True) # Use LaTeX
plt.rc('font', family='serif') # Use a serif font
plt.rc('text.latex', preamble=r'\usepackage{textgreek}') # Load Greek letters
# Create plot
ax = plt.axes()
ax.scatter(df['Method A'], df['Method B'], c='k', s=20, alpha=0.6, marker='o')
# Labels
ax.set_title('The Raw Data')
ax.set_xlabel('Method A')
ax.set_ylabel('Method B')
# Get axis limits
left, right = ax.get_xlim()
# Set axis limits
ax.set_xlim(0, right)
ax.set_ylim(0, right)
# Set aspect ratio
ax.set_aspect('equal')
# Show plot
plt.show()
These points seem to roughly show a straight-line relationship, but let’s be a bit more precise and start to quantify exactly how much they correlate…
As described on its Wikipedia page, the Pearson r statistic measures linear correlation between two variables. It assumes that a set of data points can be approximated by a straight line and returns a value that describes how strong the linear correlation is:
For more precise interpretations of the absolute value of r the follow can be used as a guideline2:
…and here’s how to calculate it:
import scipy.stats as st
import numpy as np
# Pearson correlation coefficient, r
x = df['Method A']
y = df['Method B']
r, p = st.pearsonr(x, y)
# Transform r into a z-score via Fisher transformation
r_z = np.arctanh(r) # These two ways are equivalent
r_z = 0.5 * np.log((1 + r) / (1 - r)) # These two ways are equivalent
# Sample size
n = len(df)
# Standard error
se = 1 / np.sqrt(n - 3)
# Significance level, α
alpha = 0.05 # 95% confidence level
# Number of tails
tails = 2
# Quantile (the cumulative probability)
q = 1 - (alpha / tails)
# Critical z-score, calculated using the percent-point function (aka the
# quantile function) of the normal distribution
z_crit = st.norm.ppf(q)
# Confidence interval in z-space
low_z, high_z = r_z - z_crit * se, r_z + z_crit * se
# Transform back into r-space
low, high = np.tanh((low_z, high_z))
print(f'r = {r:5.3f}, 95% CI: [{low:5.3f}, {high:5.3f}]; p = {p:.2e}')
## r = 0.996, 95% CI: [0.991, 0.998]; p = 1.26e-30
This r-value is close to 1, which confirms that the points nearly follow a straight line. Let’s visualise this by using the r statistic to generate a line of best fit:
# Pearson correlation coefficient, r
r, p = st.pearsonr(x, y)
# Gradient of line of best fit
m = r * np.std(y) / np.std(x)
# Y-intercept of line of best-fit
c = np.mean(y) - m * np.mean(x)
# Create plot
ax = plt.axes()
ax.scatter(df['Method A'], df['Method B'], c='k', s=20, alpha=0.6, marker='o')
# Get axis limits
left, right = ax.get_xlim()
# Set axis limits
ax.set_xlim(0, right)
ax.set_ylim(0, right)
# Line of best fit
x = np.array([left, right])
y = m * x + c
label = rf'r = {r:5.3f}, 95\% CI: [{low:5.3f}, {high:5.3f}]; p = {p:.2e}'
ax.plot(x, y, c='grey', ls='--', label=label)
# Labels
ax.set_title('Pearson Correlation Analysis')
ax.set_xlabel('Method A')
ax.set_ylabel('Method B')
# Set aspect ratio
ax.set_aspect('equal')
# Legend
ax.legend(frameon=False)
# Show plot
plt.show()
It’s often more useful to be able to do this calculation in a function so that it can be easily called as many times as you need it:
def pearson_correlation_coefficient(x, y):
"""Calculate Pearson's r with confidence interval and significance."""
r, p = st.pearsonr(x, y)
# Transform r into a z-score via Fisher transformation
r_z = np.arctanh(r)
# Sample size
n = len(df)
# Standard error
se = 1 / np.sqrt(n - 3)
# Significance level, α
alpha = 0.05 # 95% confidence level
# Number of tails
tails = 2
# Quantile (the cumulative probability)
q = 1 - (alpha / tails)
# Critical z-score, calculated using the percent-point function (aka the
# quantile function) of the normal distribution
z_crit = st.norm.ppf(q)
# Confidence interval in z-space
low_z, high_z = r_z - z_crit * se, r_z + z_crit * se
# Transform back into r-space
low, high = np.tanh((low_z, high_z))
return r, low, high, p
x = df['Method A']
y = df['Method B']
r, ci_lower, ci_upper, p = pearson_correlation_coefficient(x, y)
print(f'r = {r:5.3f}, 95% CI: [{ci_lower:5.3f}, {ci_upper:5.3f}]; p = {p:.2e}')
## r = 0.996, 95% CI: [0.991, 0.998]; p = 1.26e-30
Giavarina, D. “Understanding Bland Altman analysis”. Biochemia Medica 2015; 25(2):141-151. DOI: 10.11613/BM.2015.015. PMID: 26110027. Available here. Jump to reference: ↩︎
Swinscow, TDV. Study design and choosing a statistical test. In: Campbell, MJ, editors. Statistics at Square One. BMJ Publishing Group; 1997. Available here. Jump to reference: ↩︎