Passing–Bablok regression is a non-parametric way to, for example, compare two methods of measuring something in order to judge if they are equivalent or not. It was first described in Passing & Bablok (1983).1
Essentially, when two different methods are being used to take the same measurements the values generated can be plotted on a scatter plot and a line-of-best-fit can fitted. The Passing-Bablok method does this fitting by:
The median gradient and median y-intercept create the overall line-of-best-fit. This has an associated confidence interval which can be interpreted as follows:
See the Wikipedia article on Passing–Bablok regression for a longer overview.
Let’s use the following hypothetical example dataset which comes from Giavarina (2015)2. Imagine that a set of objects were each measured twice - once using ‘Method A’ and once using ‘Method B’ - giving the two lists of measurements below:
import pandas as pd
dct = {
'Method A': [
1, 5, 10, 20, 50, 40, 50, 60, 70, 80, 90, 100, 150, 200, 250, 300, 350,
400, 450, 500, 550, 600, 650, 700, 750, 800, 850, 900, 950, 1000
],
'Method B': [
8, 16, 30, 24, 39, 54, 40, 68, 72, 62, 122, 80, 181, 259, 275, 380,
320, 434, 479, 587, 626, 648, 738, 766, 793, 851, 871, 957, 1001, 960
]
}
df = pd.DataFrame(dct)
print(df.head())
## Method A Method B
## 0 1 8
## 1 5 16
## 2 10 30
## 3 20 24
## 4 50 39
First, let’s visualise this raw data:
import matplotlib.pyplot as plt
# Options
x = 5 # Want figures to be A5
plt.rc('figure', figsize=[46.82 * .5**(.5 * x), 33.11 * .5**(.5 * x)])
plt.rc('text', usetex=True)
plt.rc('font', family='serif')
plt.rc('text.latex', preamble=r'\usepackage{textgreek}')
# Plot
ax = plt.axes()
ax.set_title('The Raw Data')
ax.set_xlabel('Method A')
ax.set_ylabel('Method B')
# Scatter plot
ax.scatter(df['Method A'], df['Method B'], c='k', s=20, alpha=0.6, marker='o')
# Get axis limits
left, right = plt.xlim()
bottom, top = plt.ylim()
# Change axis limits
ax.set_xlim(0, right)
ax.set_ylim(0, top)
# Reference line
label = 'Reference line'
ax.plot([left, right], [left, right], c='grey', ls='--', label=label)
# Set aspect ratio
ax.set_aspect('equal')
# Legend
ax.legend(frameon=False)
# Show
plt.show()
Next, take each pair of points on this scatter plot and calculate the gradient (‘slope’, \(S\)) of the straight line that connects them.
import numpy as np
# Convert data to arrays
x = df['Method A'].values
y = df['Method B'].values
# Initialise a list of the gradients between each combination of two points
S = []
# Iterate over all combinations of two points
for i in range(len(x) - 1):
for j in range(i + 1, len(x)):
y_i, y_j, x_i, x_j = y[i], y[j], x[i], x[j]
# Ignore identical points
if (y_i == y_j) and (x_i == x_j):
continue
# Avoid division by zero errors
if x_i == x_j:
if y_i > y_j:
# Set the gradient to positive infinity
gradient = np.inf
S.append(gradient)
continue
else:
# Set the gradient to negative infinity
gradient = -np.inf
S.append(gradient)
continue
# Calculate the gradient between this pair of points
gradient = (y_i - y_j) / (x_i - x_j)
# Ignore any gradient equal to -1
if gradient == -1:
continue
# Add the gradient to the list of gradients
S.append(gradient)
Because we are considering every pair of points in our dataset and our dataset has a sample size of 30, the number of ways to connect two points is \(\binom{30}{2} = 435\). However, one of the pairs of points - \(\left( 70, 72 \right)\) and \(\left( 80, 62 \right)\) - has a connecting line the gradient of which is -1 and Passing & Bablok tell us to exclude such combinations. The number of gradients we have calculated, \(N\), is thus 434:
# N is the number of elements in S (which might be fewer than the number of
# raw data points)
N = len(S)
print(N)
## 434
In preparation for taking the median value of the gradients, sort the list of gradients:
# Sort the list of gradients in preparation for taking the median
S.sort()
However, as Passing & Bablok point out, the values of these gradients are not independent and so their median would be a biased estimator of the gradient of the overall line-of-best-fit. As such, we need to use an offset, \(K\), to calculate a shifted median, \(b\), which can be used as an estimate for the overall gradient. This offset is defined as the number of gradients that have a value of less than -1:
# K is the number of gradients less than -1
S = np.array(S)
K = (S < -1).sum()
print(K)
## 5
…and the estimated overall gradient of the line-of-best-fit is determined by taking a shifted median of all the gradients:
# Calculate the shifted median
if N % 2 != 0:
# If N is odd
idx = (N + 1) / 2 + K
# Convert to an integer and adjust for the fact that Python is 0-indexed
idx = int(idx) - 1
b = S[idx]
else:
# If N is even
idx = N / 2 + K
# Convert to an integer and adjust for the fact that Python is 0-indexed
idx = int(idx) - 1
b = 0.5 * (S[idx] + S[idx + 1])
print(b)
## 1.055312195800306
Using this estimated gradient of the line-of-best-fit, we can plug in the raw data to get the estimated y-intercept of the line-of-best-fit, \(a\), as follows:
a = np.median(y - b * x)
print(a)
## 7.081855791962137
This gives us:
print(f'Line-of-best-fit: y = {a:4.2f} + {b:4.2f} x')
## Line-of-best-fit: y = 7.08 + 1.06 x
These values match the source we are using (Giavarina (2015)).
Usually we are interested in the 95% confidence interval and so could simply use the well-known fact that this interval width corresponds to about 1.96 (or about 2) standard deviations either side of the mean. However, let’s calculate this explicitly to make sure we are being accurate:
from scipy import stats as st
# Confidence level
C = 0.95 # 95%
# Significance level
gamma = 1 - C # 0.05
# Quantile (the cumulative probability; two-tailed)
q = 1 - (gamma / 2) # 0.975
# Critical z-score, calculated using the percent-point function (aka the
# quantile function) of the normal distribution
w = st.norm.ppf(q) # 1.96
print(w)
## 1.959963984540054
Passing & Bablok provide formulas for getting the indexes of the gradients that correspond to the bounds of the confidence intervals:
# Sample size
n = len(x)
# Intermediate values
C_gamma = w * np.sqrt((n * (n - 1) * (2 * n + 5)) / 18)
M_1 = np.round((N - C_gamma) / 2)
M_2 = N - M_1 + 1
print(C_gamma, M_1, M_2)
## 109.8571032220829 162.0 273.0
# Get the lower and upper bounds of the confidence interval for the gradient
# (convert floats to ints and subtract 1 to adjust for Python being 0-indexed)
b_L = S[int(M_1) + K - 1]
b_U = S[int(M_2) + K - 1]
# Get the lower and upper bounds of the confidence interval for the y-intercept
a_L = np.median(y - b_U * x)
a_U = np.median(y - b_L * x)
We can check that these results indeed still match the source material (Giavarina (2015)):
print(f'Gradient = {b:4.2f} ({b_L:4.2f} to {b_U:4.2f})')
print(f'y-intercept = {a:4.2f} ({a_L:4.2f} to {a_U:4.2f})')
## Gradient = 1.06 (1.02 to 1.09)
## y-intercept = 7.08 (-0.30 to 19.84)
Re-plotting the raw data with the regression line plus confidence interval added in:
# Plot
ax = plt.axes()
ax.set_title('Passing-Bablok Regression')
ax.set_xlabel('Method A')
ax.set_ylabel('Method B')
# Scatter plot
ax.scatter(df['Method A'], df['Method B'], c='k', s=20, alpha=0.6, marker='o')
# Get axis limits
left, right = plt.xlim()
bottom, top = plt.ylim()
# Change axis limits
ax.set_xlim(0, right)
ax.set_ylim(0, top)
# Reference line
label = 'Reference line'
ax.plot([left, right], [left, right], c='grey', ls='--', label=label)
# Passing-Bablok regression line
x = np.array([left, right])
y = a + b * x
ax.plot(x, y, label=f'{b:4.2f}x + {a:4.2f}')
# Passing-Bablok regression line - confidence intervals
x = np.array([left, right])
y_L = a_L + b_L * x
y_U = a_U + b_U * x
label = f'Upper CI: {b_U:4.2f}x + {a_U:4.2f}'
ax.plot(x, y_U, c='tab:blue', alpha=0.2, label=label)
label = f'Lower CI: {b_L:4.2f}x + {a_L:4.2f}'
ax.plot(x, y_L, c='tab:blue', alpha=0.2, label=label)
ax.fill_between(x, y_U, y_L, alpha=0.2)
# Set aspect ratio
ax.set_aspect('equal')
# Legend
ax.legend(frameon=False)
# Show
plt.show()
Use the following data from the Real Statistics page:
x = np.array([
347, 249, 369, 286, 329, 410, 267, 295, 500, 286, 271, 506, 117, 329, 132,
274, 277, 198])
y = np.array([
371, 283, 373, 341, 353, 454, 214, 230, 510, 295, 286, 453, 114, 328, 109,
203, 305, 154
])
We’ll do the same process but wrapped into a function:
def passing_bablok(x, y):
"""Perform Passing-Bablok analysis."""
# Initialise a list of the gradients between each combination of two points
S = []
# Iterate over all combinations of two points
for i in range(len(x) - 1):
for j in range(i + 1, len(x)):
y_i, y_j, x_i, x_j = y[i], y[j], x[i], x[j]
# Ignore identical points
if (y_i == y_j) and (x_i == x_j):
continue
# Avoid division by zero errors
if x_i == x_j:
if y_i > y_j:
# Set the gradient to positive infinity
gradient = np.inf
S.append(gradient)
continue
else:
# Set the gradient to negative infinity
gradient = -np.inf
S.append(gradient)
continue
# Calculate the gradient between this pair of points
gradient = (y_i - y_j) / (x_i - x_j)
# Ignore any gradient equal to -1
if gradient == -1:
continue
# Add the gradient to the list of gradients
S.append(gradient)
# N is the number of elements in S (which might be fewer than the number of
# raw data points)
N = len(S)
# Sort the list of gradients in preparation for taking the median
S.sort()
# K is the number of gradients less than -1
S = np.array(S)
K = (S < -1).sum()
# Calculate the shifted median
if N % 2 != 0:
# If N is odd
idx = (N + 1) / 2 + K
# Convert to an integer and adjust for Python being 0-indexed
idx = int(idx) - 1
b = S[idx]
else:
# If N is even
idx = N / 2 + K
# Convert to an integer and adjust for Python being 0-indexed
idx = int(idx) - 1
b = 0.5 * (S[idx] + S[idx + 1])
a = np.median(y - b * x)
# Confidence level
C = 0.95 # 95%
# Significance level
gamma = 1 - C # 0.05
# Quantile (the cumulative probability; two-tailed)
q = 1 - (gamma / 2) # 0.975
# Critical z-score, calculated using the percent-point function (aka the
# quantile function) of the normal distribution
w = st.norm.ppf(q) # 1.96
# Sample size
n = len(x)
# Intermediate values
C_gamma = w * np.sqrt((n * (n - 1) * (2 * n + 5)) / 18)
M_1 = np.round((N - C_gamma) / 2)
M_2 = N - M_1 + 1
# Get lower and upper bounds of the confidence interval for the gradient
# (convert floats to ints and subtract 1 to adjust for 0-indexing)
b_L = S[int(M_1) + K - 1]
b_U = S[int(M_2) + K - 1]
# Get lower and upper bounds of the confidence interval for the y-intercept
a_L = np.median(y - b_U * x)
a_U = np.median(y - b_L * x)
return a, a_L, a_U, b, b_L, b_U
a, a_L, a_U, b, b_L, b_U = passing_bablok(x, y)
print(f'Gradient = {b:6.4f} ({b_L:6.4f} to {b_U:6.4f})')
## Gradient = 1.1274 (0.9198 to 1.4564)
print(f'y-intercept = {a:6.4f} ({a_L:6.4f} to {a_U:6.4f})')
## y-intercept = -33.6179 (-134.3624 to 32.7701)
These values precisely match those from the original page.
Passing H, Bablok W. A new biometrical procedure for testing the equality of measurements from two different analytical methods. Application of linear regression procedures for method comparison studies in Clinical Chemistry, Part I. Journal of Clinical Chemistry and Clinical Biochemistry. 1983;21(11):709–20. DOI: 10.1515/cclm.1983.21.11.709. PMID: 6655447. Available here. Jump to reference: ↩︎
Giavarina D. Understanding Bland Altman analysis. Biochemia Medica. 2015;25(2):141-151. DOI: 10.11613/BM.2015.015. PMID: 26110027. Available here. Jump to reference: ↩︎