Bland-Altman analysis is used to assess the agreement between two methods of measuring something, usually clinical information. It was introduced in Bland & Altman’s 1986 paper,1 and see also the Wikipedia page.
Calculating agreement is useful when discussing:
The example shown in Bland & Altman’s paper uses data created especially for the paper. Bland measured the maximum speed of expiration (peak expiratory flow rate, or ‘PEFR’) in litres per minute of 17 people - mainly his family and friends - using two different devices: a large and a mini Wright peak flow meter. This data is shown below:
import pandas as pd
dct = {
'Wright Mini (L/min)': [
512, 430, 520, 428, 500, 600, 364, 380, 658,
445, 432, 626, 260, 477, 259, 350, 451
],
'Wright Large (L/min)': [
494, 395, 516, 434, 476, 557, 413, 442, 650,
433, 417, 656, 267, 478, 178, 423, 427
]
}
df = pd.DataFrame(dct)
Here are the measurements taken by the two different devices plotted against one another (this re-produces Figure 1 of Bland & Altman (1986)):
import matplotlib.pyplot as plt
# Formatting options for plots
A = 5 # Want figures to be A5
plt.rc('figure', figsize=[46.82 * .5**(.5 * A), 33.11 * .5**(.5 * A)])
plt.rc('text', usetex=True) # Use LaTeX
plt.rc('font', family='serif') # Use a serif font
plt.rc('text.latex', preamble=r'\usepackage{textgreek}') # Load Greek letters
# Create plot
ax = plt.axes()
x = df['Wright Large (L/min)']
y = df['Wright Mini (L/min)']
ax.scatter(x, y, c='k', s=20, alpha=0.6, marker='o')
# Labels
ax.set_title('Peak Expiratory Flow Rate')
ax.set_xlabel('Large Meter (L/min)')
ax.set_ylabel('Mini Meter (L/min)')
# Get axis limits
left, right = plt.xlim()
# Set axis limits
ax.set_xlim(0, right)
ax.set_ylim(0, right)
# Reference line
ax.plot([0, right], [0, right], c='grey', ls='--', label='Line of Equality')
# Set aspect ratio
ax.set_aspect('equal')
# Legend
ax.legend(frameon=False)
# Show plot
plt.show()
As Bland and Altman say in their paper, at this point researchers will often calculate the Pearson correlation coefficient (Pearson’s r) between the two methods. Doing so will give \(r = 0.94\) with \(p \lt 0.001\):
import scipy.stats as st
r, p = st.pearsonr(x, y)
print(f'r = {r:.2f} (p = {p:.2e})')
## r = 0.94 (p = 1.40e-08)
However, they continue: “The null hypothesis here is that the measurements by the two methods are not linearly related. The probability is very small and we can safely conclude that PEFR measurements by the mini and large meters are related. However, this high correlation does not mean that the two methods agree”. To paraphrase:
All the above leads us towards looking for a new method of measuring agreement…
In essence, if we are interested in knowing to what extent two measurement methods differ we should calculate the average difference between the values they produce when measuring the same participant. If this is small it means that it is effectively irrelevant which method you use; both will yield a similar result. The two methods or devices can thus be used interchangeably, or one can be used instead of the other if it is preferable for whatever reason (eg if it’s less expensive). To check that the measurement differences are not related to the actual values that are being measured, it is useful to plot them against the means of the pairs of values (ideally you would plot them against the true values, but as these are unknown the means of the pairs of values that were measured are your best guesses).
As such, we start by calculating the means and differences of the pairs of measurements. This is the data that will be plotted:
means = (x + y) / 2
diffs = x - y
Then find the average difference and the standard deviation of the differences. These are good indicators of whether or not a method is biased and how consistent this bias is, respectively:
import numpy as np
# Average difference (aka the bias)
bias = np.mean(diffs)
# Sample standard deviation
s = np.std(diffs, ddof=1) # Use ddof=1 to get the sample standard deviation
print(f'For the differences, μ = {bias:.2f} L/min and s = {s:.2f} L/min')
## For the differences, μ = -2.12 L/min and s = 38.77 L/min
If we assume that the differences are normally distributed* it means that approximately 95% of them lie within two standard deviations of the mean difference (use 1.96 standard deviations if you want to be more accurate). The endpoints of this 95% range are known as the “limits of agreement” (LOAs):
# Limits of agreement (LOAs)
upper_loa = bias + 1.96 * s
lower_loa = bias - 1.96 * s
print(f'The limits of agreement are {upper_loa:.2f} L/min and {lower_loa:.2f} L/min')
## The limits of agreement are 73.86 L/min and -78.10 L/min
*Assuming a normal distribution for the differences is reasonable here because, even if the raw measurements themselves are not normally distributed, by taking the difference between pairs of measurements we remove a lot of the variation (‘noise’) and probably end up with mostly measurement error.
The number of standard deviations away from the mean (the z-score) that corresponds to the endpoints of an interval that contains the central 95% of the standard normal distribution is approximately equal to 1.96. This value is used in the calculation above and yield LOAs of 73.86 L/min and -78.10 L/min. Bland and Altman used a rounded value of 2 in their paper (and also used a rounded value of 38.8 for their sample standard deviation) and thus got the slightly less accurate answers of 75.5 L/min and -79.7 L/min. For maximum precision, we could calculate the z-score for a confidence level of 95% directly:
# Confidence level
C = 0.95 # 95%
# Significance level, α
alpha = 1 - C
# Number of tails
tails = 2
# Quantile (the cumulative probability)
q = 1 - (alpha / tails)
# Critical z-score, calculated using the percent-point function (aka the
# quantile function) of the normal distribution
z_star = st.norm.ppf(q)
print(f'95% of normally distributed data lies within {z_star}σ of the mean')
## 95% of normally distributed data lies within 1.959963984540054σ of the mean
The above yields LOAs of:
# Limits of agreement (LOAs)
loas = (bias - z_star * s, bias + z_star * s)
print(f'The limits of agreement are {loas} L/min')
## The limits of agreement are (-78.09590546711173, 73.86061134946466) L/min
A more convenient function that gets to this answer quicker is norm.interval()
as shown below:
# Limits of agreement (LOAs)
loas = st.norm.interval(C, bias, s)
print(np.round(loas, 2))
## [-78.1 73.86]
Now we can plot this data, re-producing Figure 2 of Bland & Altman (1986):
# Create plot
ax = plt.axes()
ax.scatter(means, diffs, c='k', s=20, alpha=0.6, marker='o')
# Plot the zero line
ax.axhline(y=0, c='k', lw=0.5)
# Plot the bias and the limits of agreement
ax.axhline(y=loas[1], c='grey', ls='--')
ax.axhline(y=bias, c='grey', ls='--')
ax.axhline(y=loas[0], c='grey', ls='--')
# Labels
ax.set_title('Bland-Altman Plot for Two Methods of Measuring PEFR')
ax.set_xlabel('Mean (L/min)')
ax.set_ylabel('Difference (L/min)')
# Get axis limits
left, right = ax.get_xlim()
bottom, top = ax.get_ylim()
# Set y-axis limits
max_y = max(abs(bottom), abs(top))
ax.set_ylim(-max_y * 1.1, max_y * 1.1)
# Set x-axis limits
domain = right - left
ax.set_xlim(left, left + domain * 1.1)
# Annotations
ax.annotate('+LOA', (right, upper_loa), (0, 7), textcoords='offset pixels')
ax.annotate(f'{upper_loa:+4.2f}', (right, upper_loa), (0, -25), textcoords='offset pixels')
ax.annotate('Bias', (right, bias), (0, 7), textcoords='offset pixels')
ax.annotate(f'{bias:+4.2f}', (right, bias), (0, -25), textcoords='offset pixels')
ax.annotate('-LOA', (right, lower_loa), (0, 7), textcoords='offset pixels')
ax.annotate(f'{lower_loa:+4.2f}', (right, lower_loa), (0, -25), textcoords='offset pixels')
# Show plot
plt.show()
Neither the bias line nor the limits of agreement are known with certainty; they each have a ‘standard error’ associated with them. The standard error, \(se\), of the bias is:
\[{se}_{bias} = \sqrt{\frac{s^2}{n}}\]
where \(s\) is the sample standard deviation and \(n\) is the sample size. The standard error of the limits of agreement is approximately:
\[{se}_{LOA} = \sqrt{\frac{3s^2}{n}}\]
These standard errors can be converted into confidence intervals and, specifically, we will use 95% confidence intervals. These are ranges of numbers in which we can say with 95% confidence that the true values of the bias and of the LOAs lie. If we use the Student’s t-distribution* to do this then the formula for the endpoints of the confidence interval will be:
\[y \pm \left(t^* \times {se} \right)\]
where \(y\) is the horizontal line in question (the bias, upper LOA or lower LOA) and \(t^*\) is the critical t-statistic for this confidence level and this sample size. The derivation of this formula is on this page.
*The t-distribution is appropriate when working with a relatively small sample size (n < 30) and/or when the population standard deviation isn’t known, both of which are true in this case.
In Python, these calculations look as follows:
# Sample size
n = len(df)
# Degrees of freedom
dof = n - 1
# Standard error of the bias
se_bias = s / np.sqrt(n)
# Standard error of the LOAs
se_loas = np.sqrt(3 * s**2 / n)
# Confidence interval for the bias
ci_bias = st.t.interval(C, dof, bias, se_bias)
# Confidence interval for the lower LOA
ci_lower_loa = st.t.interval(C, dof, loas[0], se_loas)
# Confidence interval for the upper LOA
ci_upper_loa = st.t.interval(C, dof, loas[1], se_loas)
print(
f' Lower LOA = {np.round(lower_loa, 2)}, 95% CI {np.round(ci_lower_loa, 2)}\n',
f'Bias = {np.round(bias, 2)}, 95% CI {np.round(ci_bias, 2)}\n',
f'Upper LOA = {np.round(upper_loa, 2)}, 95% CI {np.round(ci_upper_loa, 2)}'
)
## Lower LOA = -78.1, 95% CI [-112.62 -43.57]
## Bias = -2.12, 95% CI [-22.05 17.81]
## Upper LOA = 73.86, 95% CI [ 39.34 108.38]
To recap:
# Confidence intervals
ax.plot([left] * 2, list(ci_upper_loa), c='grey', ls='--', alpha=0.5)
ax.plot([left] * 2, list(ci_bias), c='grey', ls='--', alpha=0.5)
ax.plot([left] * 2, list(ci_lower_loa), c='grey', ls='--', alpha=0.5)
# Confidence intervals' caps
x_range = [left - domain * 0.025, left + domain * 0.025]
ax.plot(x_range, [ci_upper_loa[1]] * 2, c='grey', ls='--', alpha=0.5)
ax.plot(x_range, [ci_upper_loa[0]] * 2, c='grey', ls='--', alpha=0.5)
ax.plot(x_range, [ci_bias[1]] * 2, c='grey', ls='--', alpha=0.5)
ax.plot(x_range, [ci_bias[0]] * 2, c='grey', ls='--', alpha=0.5)
ax.plot(x_range, [ci_lower_loa[1]] * 2, c='grey', ls='--', alpha=0.5)
ax.plot(x_range, [ci_lower_loa[0]] * 2, c='grey', ls='--', alpha=0.5)
In Python, the t-statistic can be calculated for a given confidence level and a given sample size by using SciPy’s percent-point function (PPF) for the t-distribution:
import scipy.stats as st
# Sample size
n = len(df)
# Degrees of freedom
dof = n - 1
# Confidence level
C = 0.95 # 95%
# Significance level, α
alpha = 1 - C
# Number of tails
tails = 2
# Quantile (the cumulative probability)
q = 1 - (alpha / tails)
# Critical t-statistic, calculated using the percent-point function (aka the
# quantile function) of the t-distribution
t_star = st.t.ppf(q, dof)
print(f't* = {t_star:.2f}')
## t* = 2.12
This value of 2.12 is used in the Bland & Altman paper.
Be careful not to assign the t-statistic to the variable
t
if you have imported SciPy’s t-distribution sub-module ast
(viafrom scipy.stats import t
) as it will overwrite it. We have avoided this in the above snippet by usingimport scipy.stats as st
.
Usually, it’s simpler to calculate the t-statistic as an interval:
# Sample size
n = len(df)
# Degrees of freedom
dof = n - 1
# Confidence level
C = 0.95 # 95%
# Endpoints of the range that contains 95% of Student's t-distribution
ci = st.t.interval(C, dof)
print(np.round(ci, 2))
## [-2.12 2.12]
The t-statistic that corresponds to a 95% confidence interval given \(n - 1\) degrees of freedom can then be used to get the confidence intervals on the bias and LOA lines, as has been done above.
Using a t-statistic with the t-distribution is analogous to using a z-score with the normal distribution:
The widths of the 95% confidence intervals depend on two factors: the sample standard deviation and the sample size. Obviously, when designing an experiment, you can’t control what the deviation of your data will be, but you CAN control the number of samples you test. Hence, it’s useful to know how this number will affect the confidence you will have in your result.
As mentioned above, the confidence intervals are calculated as:
\[y \pm \left(t^* \times {se} \right)\]
where \(y\) is the horizontal line you are interested in - the bias, upper LOA or lower LOA - and \(se\) is its standard error. \(t^*\) is the critical t-statistic that corresponds to 95% confidence for the sample size in question. This equation implies that the half-widths of the confidence intervals are:
\[t^* \times {se}\]
which means that the full-widths of the confidence intervals for the bias and LOA lines, respectively, are twice these:
\[CI_{bias} = 2 \times t^* \times \sqrt{\dfrac{ {s}^2}{n}}\]
\[CI_{LOAs} = 2 \times t^* \times \sqrt{\dfrac{3 {s}^2}{n}}\]
so the widths of the confidence intervals relative to the sample standard deviations are:
\[\dfrac{CI_{bias}}{s} = 2 \times t^* \times \sqrt{\dfrac{1}{n}}\]
\[\dfrac{CI_{LOAs}}{s} = 2 \times t^* \times \sqrt{\dfrac{3}{n}}\]
In Python, these are calculated as:
# Confidence level
C = 0.95 # 95%
# Significance level, α
alpha = 1 - C
# Number of tails
tails = 2
# Quantile (the cumulative probability)
q = 1 - (alpha / tails)
# Critical t-statistic, calculated using the percent-point function (aka the
# quantile function) of the t-distribution
t_star = st.t.ppf(q, dof)
ci_width_bias = 2 * t_star * np.sqrt(1 / n)
ci_width_loas = 2 * t_star * np.sqrt(3 / n)
and they can be plotted as follows:
# Create plot
ax = plt.axes()
# Scatter plots
n = np.arange(5, 51)
dof = n - 1
t_star = st.t.ppf(q, dof)
ci_width_loas = 2 * t_star * np.sqrt(3 / n)
ax.scatter(n, ci_width_loas, c='k', s=20, alpha=0.6, marker='o')
ci_width_bias = 2 * t_star * np.sqrt(1 / n)
ax.scatter(n, ci_width_bias, c='k', s=20, alpha=0.6, marker='o')
# Smooth curves
n = np.arange(5, 50, 0.1)
dof = n - 1
t_star = st.t.ppf(q, dof)
ci_width_loas = 2 * t_star * np.sqrt(3 / n)
label = r'LOAs $\left(2 \times t^* \times \sqrt{\frac{3}{n}}\right)$'
ax.plot(n, ci_width_loas, label=label)
ci_width_bias = 2 * t_star * np.sqrt(1 / n)
label = r'Bias $\left(2 \times t^* \times \sqrt{\frac{1}{n}}\right)$'
ax.plot(n, ci_width_bias, c='tab:blue', alpha=0.2, label=label)
# Labels
ax.set_title(r'The widths of the confidence intervals (relative to the\\sample standard deviation) decrease if there are more samples')
ax.set_xlabel('Sample Size (n)')
ax.set_ylabel(r'Width of Confidence Intervals\\(Relative to Sample Standard Deviation)')
# Set y-axis limits
bottom, top = ax.get_ylim()
ax.set_ylim(0, top)
# Set x-axis limits
ax.set_xlim(5, 50)
# Legend
ax.legend(frameon=False)
# Show plot
plt.show()
As an example, if you decrease your sample size from 40 to 30, the widths of the confidence intervals of your limits of agreement will increase by 16.8%:
# Sample size
n = 40
# Degrees of freedom
dof = n - 1
# Critical t-statistic, calculated using the percent-point function (aka the
# quantile function) of the t-distribution
t_star = st.t.ppf(q, dof)
# Confidence interval width
width_40 = 2 * t_star * np.sqrt(3 / n)
# Sample size
n = 30
# Degrees of freedom
dof = n - 1
# Critical t-statistic, calculated using the percent-point function (aka the
# quantile function) of the t-distribution
t_star = st.t.ppf(q, dof)
width_30 = 2 * t_star * np.sqrt(3 / n)
increase = (width_30 - width_40) / width_40 * 100
print(
f"Your confidence intervals' widths have gone from {width_40:4.2f} standard deviations to",
f"{width_30:4.2f} standard deviations, an increase of {increase:4.1f}%"
)
## Your confidence intervals' widths have gone from 1.11 standard deviations to 1.29 standard deviations, an increase of 16.8%
When doing multiple Bland-Altman calculations it’s often useful to have it as a function. Multiple data sets can then be analysed in quick succession. Here’s an example of such a function:
def bland_altman_analysis(df):
"""
Calculate agreement statistics.
Within-subject SD and repeatability coefficient are defined in Shukla-Dave
(2019), available here:
https://www.ncbi.nlm.nih.gov/pmc/articles/PMC6526078/
"""
tests = list(df)
# Individual subject calculations
df['Means'] = df[tests].mean(axis=1)
df['Differences'] = df[tests].diff(axis=1)[tests[-1]]
df['s'] = df[tests].std(axis=1, ddof=1)
df['var'] = df['s']**2
# Whole sample calculations
summary = pd.DataFrame()
means = ['Mean of ' + test for test in tests]
for i, mean in enumerate(means):
summary.loc[1, mean] = df[tests[i]].mean()
# Sample size
summary.loc[1, 'Sample Size (n)'] = df.shape[0]
# Degrees of freedom
summary.loc[1, 'DOF'] = df.shape[0] - 1
# Bias (mean difference)
mean_diff = df['Differences'].mean()
summary.loc[1, 'Bias'] = mean_diff
# Sample standard deviations of the differences
s_diff = df['Differences'].std(ddof=1)
summary.loc[1, 'Sample SD (s)'] = s_diff
summary.loc[1, 'Lower LOA'] = mean_diff - 1.96 * s_diff
summary.loc[1, 'Upper LOA'] = mean_diff + 1.96 * s_diff
# Within-subject standard deviation
s_w = np.sqrt(df['var'].mean())
summary.loc[1, 'Within-Subject SD (Sw)'] = s_w
# Coefficient of repeatability
col = 'Repeatability Coefficient (RC)'
summary.loc[1, col] = np.sqrt(2) * 1.96 * s_w
# Return
return df, summary
…and here is how it can be used:
From the R Documentation: “These contrived repeated-measures data are taken from O’Brien and Kaiser (1985). The data are from an imaginary study in which 16 female and male subjects, who are divided into three treatments, are measured at a pretest, postest, and a follow-up session; during each session, they are measured at five occasions at intervals of one hour. The design, therefore, has two between-subject and two within-subject factors.”
from pydataset import data
OBrienKaiser = data('OBrienKaiser')
df = OBrienKaiser[['pre.3', 'pre.4']]
df, summary = bland_altman_analysis(df)
print(summary)
## Mean of pre.3 Mean of pre.4 Sample Size (n) DOF Bias Sample SD (s) \
## 1 5.3125 4.5 16.0 15.0 -0.8125 1.833712
##
## Lower LOA Upper LOA Within-Subject SD (Sw) \
## 1 -4.406576 2.781576 1.38067
##
## Repeatability Coefficient (RC)
## 1 3.827022
This example comes from here:
np.random.seed(9999)
m1 = np.random.random(20)
m2 = np.random.random(20)
df = pd.DataFrame({
'pre.1': m2,
'pre.2': m1
})
df, summary = bland_altman_analysis(df)
print(summary)
## Mean of pre.1 Mean of pre.2 Sample Size (n) DOF Bias \
## 1 0.563223 0.427121 20.0 19.0 -0.136102
##
## Sample SD (s) Lower LOA Upper LOA Within-Subject SD (Sw) \
## 1 0.352535 -0.827071 0.554866 0.261334
##
## Repeatability Coefficient (RC)
## 1 0.72438
These examples come from the same Bland & Altman (1986) paper1:
# Raw data
dct = {
'First Measurement': [
494, 395, 516, 434, 476, 557, 413, 442, 650, 433, 417, 656, 267, 478, 178, 423, 427
],
'Second Measurement': [
490, 397, 512, 401, 470, 611, 415, 431, 638, 429, 420, 633, 275, 492, 165, 372, 421
],
}
wright_large = pd.DataFrame(dct)
# Bland-Altman analysis
df, summary = bland_altman_analysis(wright_large)
print(summary)
## Mean of First Measurement Mean of Second Measurement Sample Size (n) \
## 1 450.352941 445.411765 17.0
##
## DOF Bias Sample SD (s) Lower LOA Upper LOA \
## 1 16.0 -4.941176 21.724038 -47.520291 37.637938
##
## Within-Subject SD (Sw) Repeatability Coefficient (RC)
## 1 15.306669 42.427922
# Raw data
dct = {
'First Measurement': [
512, 430, 520, 428, 500, 600, 364, 380, 658, 445, 432, 626, 260, 477, 259, 350, 451
],
'Second Measurement': [
525, 415, 508, 444, 500, 625, 460, 390, 642, 432, 420, 605, 227, 467, 268, 370, 443
],
}
wright_mini = pd.DataFrame(dct)
# Bland-Altman analysis
df, summary = bland_altman_analysis(wright_mini)
print(summary)
## Mean of First Measurement Mean of Second Measurement Sample Size (n) \
## 1 452.470588 455.352941 17.0
##
## DOF Bias Sample SD (s) Lower LOA Upper LOA \
## 1 16.0 2.882353 28.87231 -53.707375 59.472081
##
## Within-Subject SD (Sw) Repeatability Coefficient (RC)
## 1 19.910831 55.190007
Other authors have expanded upon Bland and Altman’s ideas:
Giavarina analysis is identical to Bland-Altman analysis except that it accounts for heteroscedasticity. It does this by using percentage differences (relative to the means) on the y-axis instead of arithmetic differences. It was published in Giavarina’s 2015 paper2 and a tutorial that replicates elements of that paper is over here.
Euser analysis also accounts for heteroscedasticity except that it does so via a logarithmic transformation. While this approach was mentioned in Bland & Altman (1986) it was Euser, Dekker & Le Cessie who published a method to transform the data back into the native space and calculate a meaningful coefficient of variation.3 A tutorial is over here.
Bland, M, Altman, D. “Statistical methods for assessing agreement between two methods of clinical measurement”. Lancet 1986; 327(8476):307–310. DOI: 10.1016/S0140-6736(86)90837-8. PMID: 2868172. Available here. Jump to reference: ↩︎
Giavarina, D. “Understanding Bland Altman analysis”. Biochemia Medica 2015; 25(2):141-151. DOI: 10.11613/BM.2015.015. PMID: 26110027. Available here. Jump to reference: ↩︎
Euser, A, Dekker, F, Cessie, S. “A practical approach to Bland-Altman plots and variation coefficients for log transformed variables”. Journal of Clinical Epidemiology 2008; 61(10):978–982. DOI: 10.1016/j.jclinepi.2007.11.003. PMID: 18468854. Available here. Jump to reference: ↩︎