In theory, if you measure the same thing twice you should get the same reading both times. In reality, however, all measurements will show some variability if repeated. The term repeatability refers to:
The closeness of agreement between measurements when measured under the same conditions1, 2, 3
The repeatability coefficient (RC) is a number that, if you make two measurements of the same thing under the same conditions, the difference between those two measurements will be less than the RC in 95% of cases.4, 5 The smaller the repeatability coefficient is, the better.
When we say that the repeated measurements need to be done under “the same conditions”, what we mean is that they need to be done with the same:
and be done using the same units and over a short period of time.6
As mentioned, the repeatability coefficient is the value under which the difference between any two repeat measurements of the same measurand acquired under repeatability conditions (aka identical conditions) should fall with 95% probability. If we assume normality (ie if multiple measurements of the same thing will produce values that are normally distributed around its true value) then this can be mathematically represented as:
\(RC = 1.96 \times \sqrt{2\sigma_w^2} = 2.77\sigma_w\)
where \(\sigma_w\) is the within-subject standard deviation - the standard deviation of the values you will get if you measure the same subject multiple times. In practice, it is not always feasible to attempt to calculate the within-subject standard deviation exactly (as it might require too many repeats, which can be expensive) and it isn’t always possible either (ie it might be a function instead of a single number) so it is usually approximated as the average (mean) sample standard deviation of multiple sets of repeated measurements. When approximating in this way, we use the symbol \(s_w\) to differentiate from the true value of the within-subject standard deviation, \(\sigma_w\).
In the special case where the measurements are performed exactly twice on each measurand the above equation can be re-written as follows7, 8:
\(RC = 1.96 \times \sqrt{\frac{\Sigma\left(m_2 - m_1\right)^2}{n}}\)
where \(m_1\) and \(m_2\) are the two measurements performed on each of the \(n\) measurands. Although this equation looks more complicated than the first it is actually less computationally intense. The reason this simplification works is because it just so happens that the variance of two observations is equal to half the square of their difference.9
If a random number is chosen from a normal distribution that has a mean of 0 and a standard deviation of 1, the probability that the number will be less than or equal to 1.96 is 0.95 (ie a 95% chance). Therefore, this number appears a lot in formulas that use a 95% confidence level, such as this one. It can be calculated more exactly in Python by using the percent-point function (aka the quantile function) of the normal distribution as found in Scipy’s “stats” module:
from scipy import stats
# Significance level
alpha = 0.05
# Two-tailed test
tails = 2
print(stats.norm.ppf(1 - alpha / tails))
## 1.959963984540054
Note that the stats.norm.ppf
function uses a normal distribution with a mean of 0 and a standard deviation of 1 by default but, for the record, this can be changed using the loc
and scale
keyword arguments.
Reproducibility is the closeness of agreement between measurements when measured under different conditions.
Like repeatability, it involves the same measurement being performed multiple times except now with different locations, operators or measuring systems (aka under reproducibility conditions).
Similar to the repeatability coefficient, a reproducibility coefficient (RDC) can be used. The equation for this is \(RDC = 1.96 \times \sqrt{2\sigma_w^2 + \nu^2}\) which is exactly the same as RC with the addition of the \(\nu^2\) term (which is the variability attributed to the differing conditions). The value of \(\nu^2\) depends on what variables change between measurements.
The RC requires the standard deviation of your data to be known. This could be the true standard deviation or it could be an estimated value:
Here’s an example that ostensibly uses the true value of the within-subject standard deviation. In reality, however, it’s impossible to known the true value unless you have an infinite number of measurements, so this is still just an estimate:
import numpy as np
# Fake data: 10 hypothetical measurements of some measurand
measurements = [13.63, 6.82, 10.76, 11.03, 11.49, 11.31, 7.33, 13.30, 9.77, 8.79]
# Within-subject sample standard deviation
sigma_w = np.std(measurements, ddof=1)
# Repeatability coefficient
rc = 1.96 * np.sqrt(2) * sigma_w
print(rc)
## 6.308007598388999
The 10 numbers used in the above example were generated by selecting from a normal distribution with mean 10 and standard deviation 2. The true value of \(\sigma_w\) for this example was thus 2 and the ‘correct’ value of RC was therefore \(2.77\times 2 = 5.54\).
Notice that in the above example we used ddof=1
when calculating the standard deviation. This is the ‘delta degrees of freedom’ where N - ddof
forms the denominator in the standard deviation calculation. In other words, when ddof=0
the function will calculate the population standard deviation whereas when ddof=1
the function will calculate the sample standard deviation.
In this example we only have two repeats of the measurements as opposed to the 10 we had in the first example. To make up for it, however, we have performed these repeats on 7 different measurands:
import numpy as np
# Fake data: two hypothetical measurements of seven different measurands
repeat_1 = [2.73, 2.01]
repeat_2 = [1.93, 9.10]
repeat_3 = [5.47, 7.36]
repeat_4 = [11.71, 8.26]
repeat_5 = [10.44, 9.05]
repeat_6 = [10.66, 9.97]
repeat_7 = [12.66, 10.01]
# Within-subject sample variances
var_1 = np.var(repeat_1, ddof=1)
var_2 = np.var(repeat_2, ddof=1)
var_3 = np.var(repeat_3, ddof=1)
var_4 = np.var(repeat_4, ddof=1)
var_5 = np.var(repeat_5, ddof=1)
var_6 = np.var(repeat_6, ddof=1)
var_7 = np.var(repeat_7, ddof=1)
# Mean within-subject sample variance
var_w = np.mean([var_1, var_2, var_3, var_4, var_5, var_6, var_7])
# Within-subject sample standard deviation
s_w = np.sqrt(var_w)
# Coefficient of repeatability
rc = 1.96 * np.sqrt(2) * s_w
print(rc)
## 6.493514524508281
The numbers used in this example were again taken from normal distributions, this time with means of 5 through to 11 and all with standard deviations of 2. So the ‘correct’ answer for RC was again \(2.77\times 2 = 5.54\).
Note that on this occasion we couldn’t calculate \(s_w\) immediately because you can’t meaningfully take the average of a list of standard deviations. We first had to calculate the sample variances and take the average of those, before square rooting it to get \(s_w\).
Most journal papers will cite one of Bland & Altman’s papers when they are performing repeatability analyses (indeed, their 1986 paper is the 29th most-cited paper of all time10). Because these papers are so important in this area we will use them for the examples that follow.
This first example uses lung function data from 17 participants. Each person had a first and a second measurement of their peak expiratory flow rate taken. In Python, the ‘Pandas’ package makes it easier to perform calculations on tabular data, so that is being used in the snippet below:
import numpy as np
import pandas as pd
# Raw data
wright_large = pd.DataFrame({
'First Measurement': [494, 395, 516, 434, 476, 557, 413, 442, 650, 433, 417, 656, 267, 478, 178, 423, 427],
'Second Measurement': [490, 397, 512, 401, 470, 611, 415, 431, 638, 429, 420, 633, 275, 492, 165, 372, 421],
})
# Within-subject sample variances
var_w = wright_large[['First Measurement', 'Second Measurement']].var(axis=1, ddof=1)
# Mean within-subject sample variance
var_w = var_w.mean()
# Within-subject sample standard deviation
s_w = np.sqrt(var_w)
# Coefficient of repeatability
rc = 1.96 * np.sqrt(2) * s_w
print(rc)
## 42.42792199372817
The second example from the same paper:
import numpy as np
import pandas as pd
# Raw data
wright_mini = pd.DataFrame({
'First Measurement': [512, 430, 520, 428, 500, 600, 364, 380, 658, 445, 432, 626, 260, 477, 259, 350, 451],
'Second Measurement': [525, 415, 508, 444, 500, 625, 460, 390, 642, 432, 420, 605, 227, 467, 268, 370, 443],
})
# Within-subject sample variances
var_w = wright_mini[['First Measurement', 'Second Measurement']].var(axis=1, ddof=1)
# Mean within-subject sample variance
var_w = var_w.mean()
# Within-subject sample standard deviation
s_w = np.sqrt(var_w)
# Coefficient of repeatability
rc = 1.96 * np.sqrt(2) * s_w
print(rc)
## 55.19000676806285
Note that Bland & Altman actually rounded off a bit when they did these calculations and also used a distance of 2 standard deviations from the mean instead of the more accurate 1.96. As a result, their answers are 43.2 l/min for the large meter and 56.4 l/min for the mini meter whereas ours are 42.4 and 55.2. Additionally, they used the simplified version of the calculation that only works when exactly two repeats have been performed, as mentioned above. For completeness, here is their method:
# Measurement differences
diffs = wright_large['First Measurement'] - wright_large['Second Measurement']
# Squared differences
sq_diffs = diffs**2
# Mean squared difference (aka within-subject variance)
var_w = sq_diffs.mean()
# Within-subject sample standard deviation
s_w = np.sqrt(var_w)
# Coefficient of repeatability
rc = 1.96 * s_w
print(rc)
## 42.42792199372816
# Measurement differences
diffs = wright_mini['First Measurement'] - wright_mini['Second Measurement']
# Squared differences
sq_diffs = diffs**2
# Total squared differences
print(sq_diffs.sum())
## 13479
# Mean squared difference (aka within-subject variance)
var_w = sq_diffs.mean()
# Within-subject sample standard deviation
s_w = np.sqrt(var_w)
# Coefficient of repeatability
rc = 1.96 * s_w
print(rc)
## 55.19000676806285
The above is also calculated on this page except that they use 2 standard deviations instead of 1.96 and hence their answer is 56.3163 as opposed to 55.1900.
In their 1996 paper, Bland and Altman demonstrate how to calculate the repeatability coefficient when multiple repeats of the same measurement have been performed on each measurand.
Instead of copy-pasting the same code we used above to replicate this example, let’s create a function to do it:
import numpy as np
import pandas as pd
pd.set_option('display.max_columns', 20)
def repeatability_coefficient(df):
"""Calculate repeatability coefficient."""
# Within-subject sample variances
var_w = df.var(axis=1, ddof=1)
# Mean within-subject sample variance
var_w = var_w.mean()
# Within-subject sample standard deviation
s_w = np.sqrt(var_w)
# Coefficient of repeatability
rc = 1.96 * np.sqrt(2) * s_w
return rc
def summary_table(df):
"""Create a summary table of agreement statistics."""
summary = {}
# Mean
for colname, col in df.items():
summary['Mean of ' + colname] = col.mean()
# Sample size
summary['N'] = df.shape[0]
# Degrees of freedom
summary['DoF'] = df.shape[0] - 1
# Sample standard deviations
s = df.std(axis=1, ddof=1)
# Sample variances
var = s**2
# Mean variance
summary['Mean Variance'] = var.mean()
# Within-subject standard deviation
s_w = np.sqrt(var.mean())
summary['Within-Subject SD (Sw)'] = s_w
# Coefficient of repeatability
rc = repeatability_coefficient(df)
summary['Repeatability Coefficient (RC)'] = rc
return summary
# Raw data
df = pd.DataFrame({
'Read 1': [190, 220, 260, 210, 270, 280, 260, 275, 280, 320, 300, 270, 320, 335, 350, 360, 330, 335, 400, 430],
'Read 2': [220, 200, 260, 300, 265, 280, 280, 275, 290, 290, 300, 250, 330, 320, 320, 320, 340, 385, 420, 460],
'Read 3': [200, 240, 240, 280, 280, 270, 280, 275, 300, 300, 310, 330, 330, 335, 340, 350, 380, 360, 425, 480],
'Read 4': [200, 230, 280, 265, 270, 275, 300, 305, 290, 290, 300, 370, 330, 375, 365, 345, 390, 370, 420, 470],
})
# Calculate agreement statistics
summary = summary_table(df)
for key, value in summary.items():
print(key, '=', round(value, 2))
## Mean of Read 1 = 299.75
## Mean of Read 2 = 305.25
## Mean of Read 3 = 315.25
## Mean of Read 4 = 322.0
## N = 20
## DoF = 19
## Mean Variance = 460.52
## Within-Subject SD (Sw) = 21.46
## Repeatability Coefficient (RC) = 59.48
Now that we have a function, let’s use it on the data from Bland and Altman’s 1999 paper.11 These are systolic blood pressure measurements in mmHg taken by an operator J and a machine S:
# Raw data
j = pd.DataFrame({
'Measurement 1': [
100, 108, 76, 108, 124, 122, 116, 114, 100, 108, 100, 108, 112, 104, 106, 122, 100, 118, 140, 150, 166, 148,
174, 174, 140, 128, 146, 146, 220, 208, 94, 114, 126, 124, 110, 90, 106, 218, 130, 136, 100, 100, 124, 164, 100,
136, 114, 148, 160, 84, 156, 110, 100, 100, 86, 106, 108, 168, 166, 146, 204, 96, 134, 138, 134, 156, 124, 114,
112, 112, 202, 132, 158, 88, 170, 182, 112, 120, 110, 112, 154, 116, 108, 106, 122
],
'Measurement 2': [
106, 110, 84, 104, 112, 140, 108, 110, 108, 92, 106, 112, 112, 108, 108, 122, 102, 118, 134, 148, 154, 156, 172,
166, 144, 134, 138, 152, 218, 200, 84, 124, 120, 124, 120, 90, 106, 202, 128, 136, 96, 98, 116, 168, 102, 126,
108, 120, 150, 92, 162, 98, 106, 102, 74, 100, 110, 188, 150, 142, 198, 94, 126, 144, 136, 160, 138, 110, 116,
116, 220, 136, 162, 76, 174, 176, 114, 118, 108, 112, 134, 112, 110, 98, 112
],
'Measurement 3': [
107, 108, 82, 104, 112, 124, 102, 112, 112, 100, 104, 122, 110, 104, 102, 114, 102, 120, 138, 144, 154, 134,
166, 150, 144, 130, 140, 148, 220, 192, 86, 116, 122, 132, 128, 94, 110, 208, 130, 130, 88, 88, 122, 154, 100,
122, 122, 132, 148, 98, 152, 98, 106, 94, 76, 110, 106, 178, 154, 132, 188, 86, 124, 140, 142, 154, 138, 114,
122, 134, 228, 134, 152, 88, 176, 180, 124, 120, 106, 106, 130, 94, 114, 100, 112
],
})
# Calculate agreement statistics
summary = summary_table(j)
for key, value in summary.items():
print(key, '=', round(value, 2))
## Mean of Measurement 1 = 128.54
## Mean of Measurement 2 = 127.29
## Mean of Measurement 3 = 126.39
## N = 85
## DoF = 84
## Mean Variance = 37.41
## Within-Subject SD (Sw) = 6.12
## Repeatability Coefficient (RC) = 16.95
# Raw data
s = pd.DataFrame({
'Measurement 1': [
122, 121, 95, 127, 140, 139, 122, 130, 119, 126, 107, 123, 131, 123, 127, 142, 104, 117, 139, 143, 181, 149,
173, 160, 158, 139, 153, 138, 228, 190, 103, 131, 131, 126, 121, 97, 116, 215, 141, 153, 113, 109, 145, 192,
112, 152, 141, 206, 151, 112, 162, 117, 119, 136, 112, 120, 117, 194, 167, 173, 228, 77, 154, 154, 145, 200,
188, 149, 136, 128, 204, 184, 163, 93, 178, 202, 162, 227, 133, 202, 158, 124, 114, 137, 121
],
'Measurement 2': [
128, 127, 94, 127, 131, 142, 112, 129, 122, 113, 113, 125, 129, 126, 119, 133, 116, 113, 127, 155, 170, 156,
170, 155, 152, 144, 150, 144, 228, 183, 99, 131, 123, 129, 114, 94, 121, 201, 133, 143, 107, 105, 102, 178, 116,
144, 141, 188, 147, 125, 165, 118, 131, 116, 115, 118, 118, 191, 160, 161, 218, 89, 156, 155, 154, 180, 147,
217, 132, 125, 222, 187, 160, 88, 181, 199, 166, 227, 127, 190, 121, 149, 118, 135, 123
],
'Measurement 3': [
124, 128, 98, 135, 124, 136, 112, 135, 122, 111, 111, 125, 122, 114, 126, 137, 115, 112, 113, 133, 166, 140,
154, 170, 154, 141, 154, 131, 226, 184, 106, 124, 124, 125, 125, 96, 127, 207, 146, 138, 102, 97, 137, 171, 116,
147, 137, 166, 136, 124, 189, 109, 124, 113, 104, 132, 115, 196, 161, 154, 189, 101, 141, 148, 166, 179, 139,
192, 133, 142, 224, 192, 152, 88, 181, 195, 148, 219, 126, 213, 134, 137, 126, 134, 128
],
})
# Calculate agreement statistics
summary = summary_table(s)
for key, value in summary.items():
print(key, '=', round(value, 2))
## Mean of Measurement 1 = 144.84
## Mean of Measurement 2 = 142.74
## Mean of Measurement 3 = 141.51
## N = 85
## DoF = 84
## Mean Variance = 83.14
## Within-Subject SD (Sw) = 9.12
## Repeatability Coefficient (RC) = 25.27
As the machine (S) has an RC of 25.27 mmHg while the operator (J) has an RC of 16.95 mmHg, we can say that the repeatability of the machine is 49% greater (ie worse) that operator J.
Whenever you write a function, it’s good practice to document it correctly using a docstring. Here is the function used above re-produced with a full docstring:
def repeatability_coefficient(df):
"""
Calculate repeatability coefficient.
See here:
https://rowannicholls.github.io/python/statistics/agreement/repeatability_coefficient.html
Parameters
----------
df : DataFrame
The observations or measurements made by various observers. One
observer's data per column, one observation/measurement per row. This
function requires df to only have two sets of observations from the
'expert' and any number of sets of observation by others.
Returns
-------
rc : float
The repeatability coefficient.
Notes
-----
The "1.96" in the formula "rc = 1.96 * np.sqrt(2) * s_w" can be calculated
more precisely from the normal distribution's percent-point function (PPF)
- for a confidence level of 95% and assuming a two-tailed test - as
follows:
>>> from scipy import stats
>>> alpha = 0.05
>>> tails = 2
>>> print(stats.norm.ppf(1 - alpha / tails))
1.959963984540054
Examples
--------
>>> dct = {
... 'x1': [1, 2, 3, 4, 5],
... 'x2': [2, 3, 3, 5, 5],
... 'y1': [2, 1, 2, 3, 4],
... 'y2': [1, 1, 3, 2, 4],
... 'y3': [2, 3, 2, 4, 5]
... }
>>> df = pd.DataFrame(dct)
>>> print(repeatability_coefficient(df))
2.217486865801013
This example is taken from Bland (1986), although the value of RC in that
paper is given as 43.2 (not 42.4) due to rounding errors.
>>> df = pd.DataFrame({
... 'First Measurement': [
... 494, 395, 516, 434, 476, 557, 413, 442, 650, 433, 417, 656,
... 267, 478, 178, 423, 427
... ],
... 'Second Measurement': [
... 490, 397, 512, 401, 470, 611, 415, 431, 638, 429, 420, 633,
... 275, 492, 165, 372, 421
... ]
... })
>>> rc = repeatability_coefficient(df)
>>> print(f'RC = {rc}')
RC = 42.42792199372817
"""
# Within-subject sample variances
var_w = df.var(axis=1, ddof=1)
# Mean within-subject sample variance
var_w = var_w.mean()
# Within-subject sample standard deviation
s_w = np.sqrt(var_w)
# Coefficient of repeatability
rc = 1.96 * np.sqrt(2) * s_w
return rc
BIPM, IEC, IFCC, ILAC, ISO, IUPAC, IUPAP, OIML. “Evaluation of measurement data –- Guide to the expression of uncertainty in measurement”. Joint Committee for Guides in Metrology, JCGM 100:2008. Available here. Jump to reference: ↩︎
Algorithm Comparison Working Group. “Quantitative imaging biomarkers: a review of statistical methods for computer algorithm comparisons”. Statistical Methods in Medical Research 2015; 24(1):68–106. DOI: 10.1177/0962280214537390. PMID: 24919829. Available here. Jump to reference: ↩︎
QIBA Terminology Working Group. “The emerging science of quantitative imaging biomarkers terminology and definitions for scientific studies and regulatory submissions”. Statistical Methods in Medical Research 2015; 24(1):9-26. DOI: 10.1177/0962280214537333. PMID: 24919826. Jump to reference: ↩︎
British Standards Institution. “Precision of test methods 1: Guide for the determination and reproducibility for a standard test method”. British Standards 1975; 597(Part 1). Jump to reference: ↩︎
Quantitative Imaging Biomarkers Alliance (QIBA). “Indices of Repeatability, Reproducibility, and Agreement”. 2013. Available here. Jump to reference: ↩︎
Barnhart, H, Haber, M, Lin, L. “An overview on assessing agreement with continuous measurements”. Journal of Biopharmaceutical Statistics 2007; 17(4):529-569. DOI: 10.1080/10543400701376480. Jump to reference: ↩︎
Bland, M, Altman, D. “Statistical methods for assessing agreement between two methods of clinical measurement”. Lancet 1986; 327(8476):307–310. DOI: 10.1016/S0140-6736(86)90837-8. PMID: 2868172. Available here. Jump to reference: ↩︎
MedCalc Software Ltd. Bland-Altman plot. 2020. Available here. Jump to reference: ↩︎
Bland, J, Altman, D. “Statistics notes: Measurement error”. The BMJ 1996; 312:1654. DOI: 10.1136/bmj.312.7047.1654. Jump to reference: ↩︎
Van Noorden, R, Maher, B, Nuzzo, R. “The top 100 papers”. Nature. 2014; 514:550-553. Available here. Jump to reference: ↩︎
Bland, J, Altman, D. “Measuring agreement in method comparison studies”. Statistical Methods in Medical Research 1999; 8(2):135-160. DOI: 10.1177/096228029900800204. PMID: 10501650. Jump to reference: ↩︎