1 Where to Start: Bland & Altman’s Difference Plot
2 Visualise the Data
3 Time for Regression Analysis?
4 Bland-Altman Analysis
- 4.1 Aside: Values for the Z-Scores
- 4.2 Bland-Altman Plot
5 Confidence Intervals
6 How Does the Precision of the Test Change with Sample Size?
7 A Bland-Altman Function
8 Giavarina and Euser Analyses?

Bland-Altman analysis is used to assess the agreement between two methods of measuring something, usually clinical information. It was introduced in Bland & Altman’s 1986 paper,¹ and see also the Wikipedia page.

Calculating agreement is useful when discussing:

Test-retest experiments
Repeatability and reproducibility (aka precision)
Reliability
Intra- and inter-operator variability
Intra- and inter-test variability

1 Where to Start: Bland & Altman’s Difference Plot

The example shown in Bland & Altman’s paper uses data created especially for the paper. Bland measured the maximum speed of expiration (peak expiratory flow rate, or ‘PEFR’) in litres per minute of 17 people - mainly his family and friends - using two different devices: a large and a mini Wright peak flow meter. This data is shown below:

import pandas as pd

dct = {
    'Wright Mini (L/min)': [
        512, 430, 520, 428, 500, 600, 364, 380, 658,
        445, 432, 626, 260, 477, 259, 350, 451
    ],
    'Wright Large (L/min)': [
        494, 395, 516, 434, 476, 557, 413, 442, 650,
        433, 417, 656, 267, 478, 178, 423, 427
    ]
}
df = pd.DataFrame(dct)

2 Visualise the Data

Here are the measurements taken by the two different devices plotted against one another (this re-produces Figure 1 of Bland & Altman (1986)):

import matplotlib.pyplot as plt

# Formatting options for plots
A = 5  # Want figures to be A5
plt.rc('figure', figsize=[46.82 * .5**(.5 * A), 33.11 * .5**(.5 * A)])
plt.rc('text', usetex=True)  # Use LaTeX
plt.rc('font', family='serif')  # Use a serif font
plt.rc('text.latex', preamble=r'\usepackage{textgreek}')  # Load Greek letters

# Create plot
ax = plt.axes()
x = df['Wright Large (L/min)']
y = df['Wright Mini (L/min)']
ax.scatter(x, y, c='k', s=20, alpha=0.6, marker='o')
# Labels
ax.set_title('Peak Expiratory Flow Rate')
ax.set_xlabel('Large Meter (L/min)')
ax.set_ylabel('Mini Meter (L/min)')
# Get axis limits
left, right = plt.xlim()
# Set axis limits
ax.set_xlim(0, right)
ax.set_ylim(0, right)
# Reference line
ax.plot([0, right], [0, right], c='grey', ls='--', label='Line of Equality')
# Set aspect ratio
ax.set_aspect('equal')
# Legend
ax.legend(frameon=False)
# Show plot
plt.show()

3 Time for Regression Analysis?

As Bland and Altman say in their paper, at this point researchers will often calculate the Pearson correlation coefficient (Pearson’s r) between the two methods. Doing so will give \(r = 0.94\) with \(p \lt 0.001\):

import scipy.stats as st

r, p = st.pearsonr(x, y)

print(f'r = {r:.2f} (p = {p:.2e})')

## r = 0.94 (p = 1.40e-08)

However, they continue: “The null hypothesis here is that the measurements by the two methods are not linearly related. The probability is very small and we can safely conclude that PEFR measurements by the mini and large meters are related. However, this high correlation does not mean that the two methods agree”. To paraphrase:

The Pearson correlation coefficient measures how correlated two variables are, not the extent to which they agree. Perfect agreement implies that all the points lie along the line of equality, while a large value for \(r\) merely implies that they all lie on a straight line (which could be any straight line)
Correlation does not depend on the scale of the measurements. If one device was calibrated incorrectly and as such gave measurements in units of mL/min instead of L/min the correlation would remain the same, but the two devices would definitely not agree!
Correlation depends on the range over which you test. If we only look at the data below 500 L/min or only the data above 500 L/min we get smaller values for \(r\) (0.88 and 0.90 respectively) than when we look at all the data together (\(r = 0.94\)). However, it would be absurd to argue that agreement between the two devices is worse below 500 L/min and worse above 500 L/min than it is for everybody.
The test for significance may show that the two methods are related, but it would be amazing if two methods designed to measure the same quantity were not related! The test of significance is irrelevant to the question of agreement.
The statistic is difficult to interpret: is an r-value of 0.992 much worse than one of 0.996? How much worse?

All the above leads us towards looking for a new method of measuring agreement…

4 Bland-Altman Analysis

In essence, if we are interested in knowing to what extent two measurement methods differ we should calculate the average difference between the values they produce when measuring the same participant. If this is small it means that it is effectively irrelevant which method you use; both will yield a similar result. The two methods or devices can thus be used interchangeably, or one can be used instead of the other if it is preferable for whatever reason (eg if it’s less expensive). To check that the measurement differences are not related to the actual values that are being measured, it is useful to plot them against the means of the pairs of values (ideally you would plot them against the true values, but as these are unknown the means of the pairs of values that were measured are your best guesses).

As such, we start by calculating the means and differences of the pairs of measurements. This is the data that will be plotted:

means = (x + y) / 2
diffs = x - y

Then find the average difference and the standard deviation of the differences. These are good indicators of whether or not a method is biased and how consistent this bias is, respectively:

import numpy as np

# Average difference (aka the bias)
bias = np.mean(diffs)
# Sample standard deviation
s = np.std(diffs, ddof=1)  # Use ddof=1 to get the sample standard deviation

print(f'For the differences, μ = {bias:.2f} L/min and s = {s:.2f} L/min')

## For the differences, μ = -2.12 L/min and s = 38.77 L/min

If we assume that the differences are normally distributed* it means that approximately 95% of them lie within two standard deviations of the mean difference (use 1.96 standard deviations if you want to be more accurate). The endpoints of this 95% range are known as the “limits of agreement” (LOAs):

# Limits of agreement (LOAs)
upper_loa = bias + 1.96 * s
lower_loa = bias - 1.96 * s

print(f'The limits of agreement are {upper_loa:.2f} L/min and {lower_loa:.2f} L/min')

## The limits of agreement are 73.86 L/min and -78.10 L/min

*Assuming a normal distribution for the differences is reasonable here because, even if the raw measurements themselves are not normally distributed, by taking the difference between pairs of measurements we remove a lot of the variation (‘noise’) and probably end up with mostly measurement error.

4.1 Aside: Values for the Z-Scores

The number of standard deviations away from the mean (the z-score) that corresponds to the endpoints of an interval that contains the central 95% of the standard normal distribution is approximately equal to 1.96. This value is used in the calculation above and yield LOAs of 73.86 L/min and -78.10 L/min. Bland and Altman used a rounded value of 2 in their paper (and also used a rounded value of 38.8 for their sample standard deviation) and thus got the slightly less accurate answers of 75.5 L/min and -79.7 L/min. For maximum precision, we could calculate the z-score for a confidence level of 95% directly:

# Confidence level
C = 0.95  # 95%
# Significance level, α
alpha = 1 - C
# Number of tails
tails = 2
# Quantile (the cumulative probability)
q = 1 - (alpha / tails)
# Critical z-score, calculated using the percent-point function (aka the
# quantile function) of the normal distribution
z_star = st.norm.ppf(q)

print(f'95% of normally distributed data lies within {z_star}σ of the mean')

## 95% of normally distributed data lies within 1.959963984540054σ of the mean

The above yields LOAs of:

# Limits of agreement (LOAs)
loas = (bias - z_star * s, bias + z_star * s)

print(f'The limits of agreement are {loas} L/min')

## The limits of agreement are (-78.09590546711173, 73.86061134946466) L/min

A more convenient function that gets to this answer quicker is norm.interval() as shown below:

# Limits of agreement (LOAs)
loas = st.norm.interval(C, bias, s)

print(np.round(loas, 2))

## [-78.1   73.86]

4.2 Bland-Altman Plot

Now we can plot this data, re-producing Figure 2 of Bland & Altman (1986):

# Create plot
ax = plt.axes()
ax.scatter(means, diffs, c='k', s=20, alpha=0.6, marker='o')
# Plot the zero line
ax.axhline(y=0, c='k', lw=0.5)
# Plot the bias and the limits of agreement
ax.axhline(y=loas[1], c='grey', ls='--')
ax.axhline(y=bias, c='grey', ls='--')
ax.axhline(y=loas[0], c='grey', ls='--')
# Labels
ax.set_title('Bland-Altman Plot for Two Methods of Measuring PEFR')
ax.set_xlabel('Mean (L/min)')
ax.set_ylabel('Difference (L/min)')
# Get axis limits
left, right = ax.get_xlim()
bottom, top = ax.get_ylim()
# Set y-axis limits
max_y = max(abs(bottom), abs(top))
ax.set_ylim(-max_y * 1.1, max_y * 1.1)
# Set x-axis limits
domain = right - left
ax.set_xlim(left, left + domain * 1.1)
# Annotations
ax.annotate('+LOA', (right, upper_loa), (0, 7), textcoords='offset pixels')
ax.annotate(f'{upper_loa:+4.2f}', (right, upper_loa), (0, -25), textcoords='offset pixels')
ax.annotate('Bias', (right, bias), (0, 7), textcoords='offset pixels')
ax.annotate(f'{bias:+4.2f}', (right, bias), (0, -25), textcoords='offset pixels')
ax.annotate('-LOA', (right, lower_loa), (0, 7), textcoords='offset pixels')
ax.annotate(f'{lower_loa:+4.2f}', (right, lower_loa), (0, -25), textcoords='offset pixels')
# Show plot
plt.show()

5 Confidence Intervals

Neither the bias line nor the limits of agreement are known with certainty; they each have a ‘standard error’ associated with them. The standard error, \(se\), of the bias is:

\[{se}_{bias} = \sqrt{\frac{s^2}{n}}\]

where \(s\) is the sample standard deviation and \(n\) is the sample size. The standard error of the limits of agreement is approximately:

\[{se}_{LOA} = \sqrt{\frac{3s^2}{n}}\]

These standard errors can be converted into confidence intervals and, specifically, we will use 95% confidence intervals. These are ranges of numbers in which we can say with 95% confidence that the true values of the bias and of the LOAs lie. If we use the Student’s t-distribution* to do this then the formula for the endpoints of the confidence interval will be:

\[y \pm \left(t^* \times {se} \right)\]

where \(y\) is the horizontal line in question (the bias, upper LOA or lower LOA) and \(t^*\) is the critical t-statistic for this confidence level and this sample size. The derivation of this formula is on this page.

*The t-distribution is appropriate when working with a relatively small sample size (n < 30) and/or when the population standard deviation isn’t known, both of which are true in this case.

In Python, these calculations look as follows:

# Sample size
n = len(df)
# Degrees of freedom
dof = n - 1
# Standard error of the bias
se_bias = s / np.sqrt(n)
# Standard error of the LOAs
se_loas = np.sqrt(3 * s**2 / n)

# Confidence interval for the bias
ci_bias = st.t.interval(C, dof, bias, se_bias)
# Confidence interval for the lower LOA
ci_lower_loa = st.t.interval(C, dof, loas[0], se_loas)
# Confidence interval for the upper LOA
ci_upper_loa = st.t.interval(C, dof, loas[1], se_loas)

print(
    f' Lower LOA = {np.round(lower_loa, 2)}, 95% CI {np.round(ci_lower_loa, 2)}\n',
    f'Bias = {np.round(bias, 2)}, 95% CI {np.round(ci_bias, 2)}\n',
    f'Upper LOA = {np.round(upper_loa, 2)}, 95% CI {np.round(ci_upper_loa, 2)}'
)

##  Lower LOA = -78.1, 95% CI [-112.62  -43.57]
##  Bias = -2.12, 95% CI [-22.05  17.81]
##  Upper LOA = 73.86, 95% CI [ 39.34 108.38]

To recap:

We are not assuming that the raw measurements are normally distributed (although they might be)
We are assuming that the differences between pairs of raw measurements are normally distributed (because subtracting them from each other removes a lot of the variation between subjects while leaving measurement error)
- We are therefore using the normal distribution to calculate the limits of agreement (via us using 1.96 standard deviations)
The sample size is small (\(n = 17\))
- We are therefore using the t-distribution to calculate the confidence intervals. If the sample size was large (\(n \geq 30\)) we could use the normal distribution to calculate the confidence intervals as well, see the page on confidence intervals for more.

5.1 Add Confidence Intervals to the Plot

# Confidence intervals
ax.plot([left] * 2, list(ci_upper_loa), c='grey', ls='--', alpha=0.5)
ax.plot([left] * 2, list(ci_bias), c='grey', ls='--', alpha=0.5)
ax.plot([left] * 2, list(ci_lower_loa), c='grey', ls='--', alpha=0.5)
# Confidence intervals' caps
x_range = [left - domain * 0.025, left + domain * 0.025]
ax.plot(x_range, [ci_upper_loa[1]] * 2, c='grey', ls='--', alpha=0.5)
ax.plot(x_range, [ci_upper_loa[0]] * 2, c='grey', ls='--', alpha=0.5)
ax.plot(x_range, [ci_bias[1]] * 2, c='grey', ls='--', alpha=0.5)
ax.plot(x_range, [ci_bias[0]] * 2, c='grey', ls='--', alpha=0.5)
ax.plot(x_range, [ci_lower_loa[1]] * 2, c='grey', ls='--', alpha=0.5)
ax.plot(x_range, [ci_lower_loa[0]] * 2, c='grey', ls='--', alpha=0.5)

5.2 Aside: the t-Statistic

In Python, the t-statistic can be calculated for a given confidence level and a given sample size by using SciPy’s percent-point function (PPF) for the t-distribution:

import scipy.stats as st

# Sample size
n = len(df)
# Degrees of freedom
dof = n - 1
# Confidence level
C = 0.95  # 95%
# Significance level, α
alpha = 1 - C
# Number of tails
tails = 2
# Quantile (the cumulative probability)
q = 1 - (alpha / tails)
# Critical t-statistic, calculated using the percent-point function (aka the
# quantile function) of the t-distribution
t_star = st.t.ppf(q, dof)

print(f't* = {t_star:.2f}')

## t* = 2.12

This value of 2.12 is used in the Bland & Altman paper.

Be careful not to assign the t-statistic to the variable t if you have imported SciPy’s t-distribution sub-module as t (via from scipy.stats import t) as it will overwrite it. We have avoided this in the above snippet by using import scipy.stats as st.

Usually, it’s simpler to calculate the t-statistic as an interval:

# Sample size
n = len(df)
# Degrees of freedom
dof = n - 1
# Confidence level
C = 0.95  # 95%
# Endpoints of the range that contains 95% of Student's t-distribution
ci = st.t.interval(C, dof)

print(np.round(ci, 2))

## [-2.12  2.12]

The t-statistic that corresponds to a 95% confidence interval given \(n - 1\) degrees of freedom can then be used to get the confidence intervals on the bias and LOA lines, as has been done above.

5.3 Aside: the t-Statistic vs the Z-Score

The t-statistic is the distance of the mean of a sample of values from the expected mean, as a multiple of standard errors
The z-score is the distance of a single value from the expected value, as a multiple of standard deviations

Using a t-statistic with the t-distribution is analogous to using a z-score with the normal distribution:

For a given t-statistic and sample size (more specifically, for a given t-statistic and number of degrees of freedom) a certain percentage of the t-distribution will lie within that distance from the mean
For a given z-score, a certain percentage of the normal distribution will lie with that distance from the mean

6 How Does the Precision of the Test Change with Sample Size?

The widths of the 95% confidence intervals depend on two factors: the sample standard deviation and the sample size. Obviously, when designing an experiment, you can’t control what the deviation of your data will be, but you CAN control the number of samples you test. Hence, it’s useful to know how this number will affect the confidence you will have in your result.

As mentioned above, the confidence intervals are calculated as:

\[y \pm \left(t^* \times {se} \right)\]

where \(y\) is the horizontal line you are interested in - the bias, upper LOA or lower LOA - and \(se\) is its standard error. \(t^*\) is the critical t-statistic that corresponds to 95% confidence for the sample size in question. This equation implies that the half-widths of the confidence intervals are:

\[t^* \times {se}\]

which means that the full-widths of the confidence intervals for the bias and LOA lines, respectively, are twice these:

\[CI_{bias} = 2 \times t^* \times \sqrt{\dfrac{ {s}^2}{n}}\]

\[CI_{LOAs} = 2 \times t^* \times \sqrt{\dfrac{3 {s}^2}{n}}\]

so the widths of the confidence intervals relative to the sample standard deviations are:

\[\dfrac{CI_{bias}}{s} = 2 \times t^* \times \sqrt{\dfrac{1}{n}}\]

\[\dfrac{CI_{LOAs}}{s} = 2 \times t^* \times \sqrt{\dfrac{3}{n}}\]

In Python, these are calculated as:

# Confidence level
C = 0.95  # 95%
# Significance level, α
alpha = 1 - C
# Number of tails
tails = 2
# Quantile (the cumulative probability)
q = 1 - (alpha / tails)
# Critical t-statistic, calculated using the percent-point function (aka the
# quantile function) of the t-distribution
t_star = st.t.ppf(q, dof)

ci_width_bias = 2 * t_star * np.sqrt(1 / n)
ci_width_loas = 2 * t_star * np.sqrt(3 / n)

and they can be plotted as follows:

# Create plot
ax = plt.axes()
# Scatter plots
n = np.arange(5, 51)
dof = n - 1
t_star = st.t.ppf(q, dof)
ci_width_loas = 2 * t_star * np.sqrt(3 / n)
ax.scatter(n, ci_width_loas, c='k', s=20, alpha=0.6, marker='o')
ci_width_bias = 2 * t_star * np.sqrt(1 / n)
ax.scatter(n, ci_width_bias, c='k', s=20, alpha=0.6, marker='o')
# Smooth curves
n = np.arange(5, 50, 0.1)
dof = n - 1
t_star = st.t.ppf(q, dof)
ci_width_loas = 2 * t_star * np.sqrt(3 / n)
label = r'LOAs $\left(2 \times t^* \times \sqrt{\frac{3}{n}}\right)$'
ax.plot(n, ci_width_loas, label=label)
ci_width_bias = 2 * t_star * np.sqrt(1 / n)
label = r'Bias $\left(2 \times t^* \times \sqrt{\frac{1}{n}}\right)$'
ax.plot(n, ci_width_bias, c='tab:blue', alpha=0.2, label=label)
# Labels
ax.set_title(r'The widths of the confidence intervals (relative to the\\sample standard deviation) decrease if there are more samples')
ax.set_xlabel('Sample Size (n)')
ax.set_ylabel(r'Width of Confidence Intervals\\(Relative to Sample Standard Deviation)')
# Set y-axis limits
bottom, top = ax.get_ylim()
ax.set_ylim(0, top)
# Set x-axis limits
ax.set_xlim(5, 50)
# Legend
ax.legend(frameon=False)
# Show plot
plt.show()

As an example, if you decrease your sample size from 40 to 30, the widths of the confidence intervals of your limits of agreement will increase by 16.8%:

# Sample size
n = 40
# Degrees of freedom
dof = n - 1
# Critical t-statistic, calculated using the percent-point function (aka the
# quantile function) of the t-distribution
t_star = st.t.ppf(q, dof)
# Confidence interval width
width_40 = 2 * t_star * np.sqrt(3 / n)

# Sample size
n = 30
# Degrees of freedom
dof = n - 1
# Critical t-statistic, calculated using the percent-point function (aka the
# quantile function) of the t-distribution
t_star = st.t.ppf(q, dof)
width_30 = 2 * t_star * np.sqrt(3 / n)

increase = (width_30 - width_40) / width_40 * 100
print(
    f"Your confidence intervals' widths have gone from {width_40:4.2f} standard deviations to",
    f"{width_30:4.2f} standard deviations, an increase of {increase:4.1f}%"
)

## Your confidence intervals' widths have gone from 1.11 standard deviations to 1.29 standard deviations, an increase of 16.8%

7 A Bland-Altman Function

When doing multiple Bland-Altman calculations it’s often useful to have it as a function. Multiple data sets can then be analysed in quick succession. Here’s an example of such a function:

def bland_altman_analysis(df):
    """
    Calculate agreement statistics.

    Within-subject SD and repeatability coefficient are defined in Shukla-Dave
    (2019), available here:
    https://www.ncbi.nlm.nih.gov/pmc/articles/PMC6526078/
    """
    tests = list(df)

    # Individual subject calculations
    df['Means'] = df[tests].mean(axis=1)
    df['Differences'] = df[tests].diff(axis=1)[tests[-1]]
    df['s'] = df[tests].std(axis=1, ddof=1)
    df['var'] = df['s']**2

    # Whole sample calculations
    summary = pd.DataFrame()
    means = ['Mean of ' + test for test in tests]
    for i, mean in enumerate(means):
        summary.loc[1, mean] = df[tests[i]].mean()
    # Sample size
    summary.loc[1, 'Sample Size (n)'] = df.shape[0]
    # Degrees of freedom
    summary.loc[1, 'DOF'] = df.shape[0] - 1
    # Bias (mean difference)
    mean_diff = df['Differences'].mean()
    summary.loc[1, 'Bias'] = mean_diff
    # Sample standard deviations of the differences
    s_diff = df['Differences'].std(ddof=1)
    summary.loc[1, 'Sample SD (s)'] = s_diff
    summary.loc[1, 'Lower LOA'] = mean_diff - 1.96 * s_diff
    summary.loc[1, 'Upper LOA'] = mean_diff + 1.96 * s_diff
    # Within-subject standard deviation
    s_w = np.sqrt(df['var'].mean())
    summary.loc[1, 'Within-Subject SD (Sw)'] = s_w
    # Coefficient of repeatability
    col = 'Repeatability Coefficient (RC)'
    summary.loc[1, col] = np.sqrt(2) * 1.96 * s_w

    # Return
    return df, summary

…and here is how it can be used:

7.1 O’Brien and Kaiser’s Repeated-Measures Data

From the R Documentation: “These contrived repeated-measures data are taken from O’Brien and Kaiser (1985). The data are from an imaginary study in which 16 female and male subjects, who are divided into three treatments, are measured at a pretest, postest, and a follow-up session; during each session, they are measured at five occasions at intervals of one hour. The design, therefore, has two between-subject and two within-subject factors.”

from pydataset import data

OBrienKaiser = data('OBrienKaiser')
df = OBrienKaiser[['pre.3', 'pre.4']]
df, summary = bland_altman_analysis(df)
print(summary)

##    Mean of pre.3  Mean of pre.4  Sample Size (n)   DOF    Bias  Sample SD (s)  \
## 1         5.3125            4.5             16.0  15.0 -0.8125       1.833712   
## 
##    Lower LOA  Upper LOA  Within-Subject SD (Sw)  \
## 1  -4.406576   2.781576                 1.38067   
## 
##    Repeatability Coefficient (RC)  
## 1                        3.827022

7.2 Statsmodels

This example comes from here:

np.random.seed(9999)
m1 = np.random.random(20)
m2 = np.random.random(20)
df = pd.DataFrame({
    'pre.1': m2,
    'pre.2': m1
})
df, summary = bland_altman_analysis(df)
print(summary)

##    Mean of pre.1  Mean of pre.2  Sample Size (n)   DOF      Bias  \
## 1       0.563223       0.427121             20.0  19.0 -0.136102   
## 
##    Sample SD (s)  Lower LOA  Upper LOA  Within-Subject SD (Sw)  \
## 1       0.352535  -0.827071   0.554866                0.261334   
## 
##    Repeatability Coefficient (RC)  
## 1                         0.72438

7.3 Bland-Altman (1986)

These examples come from the same Bland & Altman (1986) paper¹:

# Raw data
dct = {
    'First Measurement': [
        494, 395, 516, 434, 476, 557, 413, 442, 650, 433, 417, 656, 267, 478, 178, 423, 427
    ],
    'Second Measurement': [
        490, 397, 512, 401, 470, 611, 415, 431, 638, 429, 420, 633, 275, 492, 165, 372, 421
    ],
}
wright_large = pd.DataFrame(dct)
# Bland-Altman analysis
df, summary = bland_altman_analysis(wright_large)
print(summary)

##    Mean of First Measurement  Mean of Second Measurement  Sample Size (n)  \
## 1                 450.352941                  445.411765             17.0   
## 
##     DOF      Bias  Sample SD (s)  Lower LOA  Upper LOA  \
## 1  16.0 -4.941176      21.724038 -47.520291  37.637938   
## 
##    Within-Subject SD (Sw)  Repeatability Coefficient (RC)  
## 1               15.306669                       42.427922

# Raw data
dct = {
    'First Measurement': [
        512, 430, 520, 428, 500, 600, 364, 380, 658, 445, 432, 626, 260, 477, 259, 350, 451
    ],
    'Second Measurement': [
        525, 415, 508, 444, 500, 625, 460, 390, 642, 432, 420, 605, 227, 467, 268, 370, 443
    ],
}
wright_mini = pd.DataFrame(dct)
# Bland-Altman analysis
df, summary = bland_altman_analysis(wright_mini)
print(summary)

##    Mean of First Measurement  Mean of Second Measurement  Sample Size (n)  \
## 1                 452.470588                  455.352941             17.0   
## 
##     DOF      Bias  Sample SD (s)  Lower LOA  Upper LOA  \
## 1  16.0  2.882353       28.87231 -53.707375  59.472081   
## 
##    Within-Subject SD (Sw)  Repeatability Coefficient (RC)  
## 1               19.910831                       55.190007

8 Giavarina and Euser Analyses?

Other authors have expanded upon Bland and Altman’s ideas:

Giavarina analysis is identical to Bland-Altman analysis except that it accounts for heteroscedasticity. It does this by using percentage differences (relative to the means) on the y-axis instead of arithmetic differences. It was published in Giavarina’s 2015 paper² and a tutorial that replicates elements of that paper is over here.

Euser analysis also accounts for heteroscedasticity except that it does so via a logarithmic transformation. While this approach was mentioned in Bland & Altman (1986) it was Euser, Dekker & Le Cessie who published a method to transform the data back into the native space and calculate a meaningful coefficient of variation.³ A tutorial is over here.

⇦ Back

Bland, M, Altman, D. “Statistical methods for assessing agreement between two methods of clinical measurement”. Lancet 1986; 327(8476):307–310. DOI: 10.1016/S0140-6736(86)90837-8. PMID: 2868172. Available here. Jump to reference: ↩︎
Giavarina, D. “Understanding Bland Altman analysis”. Biochemia Medica 2015; 25(2):141-151. DOI: 10.11613/BM.2015.015. PMID: 26110027. Available here. Jump to reference: ↩︎
Euser, A, Dekker, F, Cessie, S. “A practical approach to Bland-Altman plots and variation coefficients for log transformed variables”. Journal of Clinical Epidemiology 2008; 61(10):978–982. DOI: 10.1016/j.jclinepi.2007.11.003. PMID: 18468854. Available here. Jump to reference: ↩︎

Statistics in Python:Bland-Altman Analysis