⇦ Back

This page complements this one on Bland-Altman analysis.

1 Python Packages

The code on this page uses the pandas, NumPy and SciPy packages. These can be installed from the terminal with:

# Replace "python3.12" with the version of Python you are using
python3.12 -m pip install pandas
python3.12 -m pip install numpy
python3.12 -m pip install scipy

Once finished, import these packages into your Python script as follows:

import pandas as pd
import numpy as np
from scipy import stats as st

2 Acceptance Criteria

Bland-Altman analysis is done in order to assess the agreement between two things, but in order to make an assessment you first need to have some benchmark against which you can judge your results. This is termed an acceptance criterion (AC) and might be:

  • A maximum absolute value for the limits of agreement (LOAs)
  • A maximum absolute value for the bias

Usually, the Greek letter delta is used for both of these types of acceptance criteria. To avoid confusion, this page will use a lowercase delta (δ) to refer to an AC applied to the LOAs and a capital delta (Δ) to refer to an AC applied to the bias. This is by no means a standard distinction.

The decision as to what a good criterion is for acceptable agreement is a clinical one, not a statistical one, although it should be made a priori (ie you should decide on what your ACs will be before performing the Bland-Altman analysis). Often there will be a standard or a set of guidelines as to how good the agreement must be - in other words, you will often be given an acceptance criterion. Other times you might need clinical consultation in order to determine the minimal important difference from which the smallest clinically-relevant amount of disagreement - and thus your AC - could be deduced. Essentially, the differences between the measurements produced by the two methods you are investigating must be small enough so as to not be clinically important.

3 Sample Size Calculations

Next, once the acceptance criteria are set, you will want to be able to get an idea as to whether or not your data will meet these criteria before you commit to a full-blown study. So, usually, a pilot test is done and that data is tested first. A paper published by Zhou et al1 on the topic of the agreement between two measurement methods documented exactly this: it included the results of a “pre-experiment” (pilot test) wherein the molar concentration of an antigen was measured in 24 subjects via two different methods:

# Pilot test data from Zhou et al (2011)
dct = {
    'Measurement 1 (mmol/L)': [
        0.023, 0.022, 0.025, 0.013, 0.008, 0.017, 0.026, 0.017,
        0.034, 0.007, 0.011, 0.006, 0.001, 0.006, 0.005, 0.007,
        0.011, 0.002, 0.037, 0.002, 0.003, 0.013, 0.008, 0.023,
    ],
    'Measurement 2 (mmol/L)': [
        0.021, 0.023, 0.024, 0.013, 0.007, 0.016, 0.022, 0.016,
        0.032, 0.006, 0.009, 0.004, 0.001, 0.007, 0.003, 0.005,
        0.009, 0.002, 0.036, 0.001, 0.001, 0.011, 0.007, 0.023,
    ],
}
df = pd.DataFrame(dct)

print(df.head())
##    Measurement 1 (mmol/L)  Measurement 2 (mmol/L)
## 0                   0.023                   0.021
## 1                   0.022                   0.023
## 2                   0.025                   0.024
## 3                   0.013                   0.013
## 4                   0.008                   0.007

The question then is: for a full study, what should the sample size be? How can we calculate this using only the pilot data?


  1. Zhou, YH, Zang, JJ, Wu, MJ, Xu, JF, He, J. “Allowable Total Error and Limits for Erroneous Results (ATE/LER) zones for agreement measurement”. Journal of Clinical Laboratory Analysis 2011; 25(2):83-89. DOI: 10.1002/jcla.20437. PMID: 21437998. Available here. Jump to reference: ↩︎

  2. Lu, MJ, Zhong, WH, Liu, YX, Miao, HZ, Li, YC, Ji, MH. “Sample size for assessing agreement between two methods of measurement by Bland-Altman method”. The International Journal of Biostatistics 2016; 12(2):20150039. DOI: 10.1515/ijb-2015-0039. PMID: 27838682. Available here. Jump to reference: ↩︎

  3. Schuirmann, D. “A comparison of the Two One-Sided Tests Procedure and the Power Approach for assessing the equivalence of average bioavailability”. Journal of Pharmacokinetics and Biopharmaceutics 1987; 15(6):657–680. DOI: 10.1007/BF01068419. PMID: 3450848. Available here. Jump to reference: ↩︎