Next, once the acceptance criteria are set, you will want to be able
to get an idea as to whether or not your data will meet these criteria
before you commit to a full-blown study. So, usually, a
pilot test is done and that data is tested first. A
paper published by Zhou et al on the topic of the agreement between two
measurement methods documented exactly this: it included the results of
a “pre-experiment” (pilot test) wherein the molar concentration of an
antigen was measured in 24 subjects via two different methods:
For Studies Using ACs
Related to the LOAs
For our example, an acceptance criterion of 0.004 mmol/L was
prescribed:
# Acceptance criterion
δ = 0.004 # mmol/L
This is the maximum absolute value that the two limits of agreement
can have. The test of agreement fails if the absolute value of one (or
both) LOA is larger than this, and it passes if both are less than
this.
Method 1: Bland’s
Method
This method is loosely introduced on Bland’s
website but is presented more completely here.
Once you have pilot data, use it to calculate the pilot 95% limits of
agreement. These are your best estimates as to what the limits of
agreement for the full study will be. Then calculate the 99% confidence
intervals (CIs) around these pilot limits of agreement; we can be 99%
sure that the full limits of agreement will be within these intervals.
As such, if the extreme endpoints of these intervals - the upper
endpoint of the upper LOA’s CI and the lower endpoint of the lower LOA’s
CI - fall within your acceptance criteria we can be 99% sure that the
full study will pass. If this is insufficient certainty, you can
increase the confidence level when calculating the CIs - for example to
99.9% - but then it will be less likely that the endpoints of these CIs
will meet the acceptance criteria.
# Raw data
measurement1 = df['Measurement 1 (mmol/L)']
measurement2 = df['Measurement 2 (mmol/L)']
# Means
means = (measurement1 + measurement2) / 2
# Differences
diffs = measurement1 - measurement2
# Average difference (aka the bias)
bias = np.mean(diffs)
# Sample standard deviation
s = np.std(diffs, ddof=1) # Use ddof=1 to get the sample standard deviation
print(f'Bias = {round(bias, 6)}, standard deviation = {round(s, 6)}')
## Bias = 0.001167, standard deviation = 0.001129
The level of agreement - the distance of the limits of
agreement from the mean difference - will be set, as mentioned above, at
a distance from the mean such that 95% of the sample data will fall
within the LOAs:
# Level of agreement
agreement = 0.95 # 95%
# Limits of agreement (LOAs)
loas = st.norm.interval(agreement, bias, s)
print(f'The LOAs are {round(loas[0], 6)} and {round(loas[1], 6)}')
## The LOAs are -0.001047 and 0.00338
Now the confidence intervals are calculated. The level of
confidence - the width of the confidence intervals - will be taken
as 99%, as discussed above, so that we can be 99% sure that the LOAs of
the full study will fall within them:
# Confidence level
confidence = 0.99 # 99%
# Significance level, α
alpha = 1 - confidence
# Number of tails
tails = 2
# Quantile (the cumulative probability)
q = 1 - (alpha / tails)
# Sample size
n = len(df)
# Degrees of freedom
dof = n - 1
# Critical t-statistic, calculated using the percent-point function (aka the
# quantile function) of the t-distribution
t_star = st.t.ppf(q, dof)
# Standard error of the LOAs
se_loas = np.sqrt(3 * s**2 / n)
# Endpoints of the confidence intervals
upper_ci_upper_loa = loas[1] + t_star * se_loas
lower_ci_lower_loa = loas[0] - t_star * se_loas
Now that we have the endpoints of the confidence intervals we’re
going to use the acceptance criteria to reverse-engineer the required
sample size. What we want is to find the sample size \(n\) such that the endpoints of the CIs just
match the acceptance criterion (ie the point at which the test is just
barely still passing):
\[ \delta = {LOA}_{upper} + t^* \times
{se}_{LOAs} \]
\[ -\delta = {LOA}_{lower} - t^* \times
{se}_{LOAs} \]
where:
- \(\delta\) is the acceptance
criterion (AC)
- \({LOA}_{upper}\) and \({LOA}_{lower}\) are the two limits of
agreement as calculated above
- \(t^*\) is the critical
t-statistic and \({se}_{LOAs}\) is the standard error of the
LOAs for a given sample size
This is a little tricky to calculate because both \(t^*\) and \({se}_{LOAs}\) depend on \(n\). We could use an optimisation algorithm
to find a solution for it but, instead, we will just loop through
different potential values, check each one and stop when we reach the
point where the AC is met:
# Loop through different sample sizes
# Start at 2, because if we start at 1 the number of degrees of freedom is 0
for n in np.arange(2, 1000):
# Degrees of freedom
dof = n - 1
# Critical t-statistic
t_star = st.t.ppf(q, dof)
# Standard error of the LOAs
se_loas = np.sqrt(3 * s**2 / n)
# Endpoints of the confidence intervals
upper_ci_upper_loa = loas[1] + t_star * se_loas
lower_ci_lower_loa = loas[0] - t_star * se_loas
# Test if the acceptance criterion is met
if upper_ci_upper_loa < δ:
if lower_ci_lower_loa > -δ:
break
print(
f'A sample size of {n} gives endpoints of the 99% CIs at',
f'{round(lower_ci_lower_loa, 6)} and {round(upper_ci_upper_loa, 6)}'
)
## A sample size of 70 gives endpoints of the 99% CIs at -0.001666 and 0.003999
Both -0.001666 and 0.003999 are less than the AC (0.004) in
magnitude, so we can be 99% sure that the full study will pass if we use
this sample size (if the pilot study was
representative).
Remember, \(n\) is the number of
pairs of data points you will need in the full study. So, in
this case, you will need 70 participants each measured twice (once with
each method).
Method 2: Lu’s
Method
This method is not fundamentally different from Bland’s method, it
just explicitly calculates the power (1 - the probability of making a
type II error) and thus allows for this to be set by the experimenter.
It is detailed in Lu et al’s paper and uses the same example data and
acceptance criterion as before:
# Raw data
measurement1 = df['Measurement 1 (mmol/L)']
measurement2 = df['Measurement 2 (mmol/L)']
# Means
means = (measurement1 + measurement2) / 2
# Differences
diffs = measurement1 - measurement2
# Average difference (aka the bias)
bias = np.mean(diffs)
# Sample standard deviation
s = np.std(diffs, ddof=1) # Use ddof=1 to get the sample standard deviation
# Acceptance criterion
δ = 0.004 # mmol/L
We could use a loop to try out different values for the sample size
like we did last time, but the code will be cleaner if we just do the
calculations for many different sample sizes from the start:
# Sample sizes to check
n = np.arange(2, 1000)
The limits of agreement and confidence intervals are calculated in a
similar fashion to before, although note that the two different
significance levels are now given the names \(\gamma\) and \(\alpha\) (and their corresponding critical
values are \(z_\gamma\) and \(t_\alpha\)) so that we can tell them apart.
We are also using 95% for the confidence intervals, not 99%.
#
# Limits of agreement
#
# Agreement level for the LOAs
agreement = 0.95 # 95% limits of agreement
# Significance level for the LOAs
gamma = 1 - agreement
# Number of tails
tails = 2
# Quantile (the cumulative probability)
q = 1 - (gamma / tails)
# Critical z-score, calculated using the percent-point function (aka the
# quantile function) of the normal distribution
zgamma = st.norm.ppf(q) # 1.96
# Standard error of the LOAs
se = s * np.sqrt(1 / n + zgamma**2 / (2 * (n - 1)))
#
# Confidence intervals
#
# Confidence level for the confidence intervals
confidence = 0.95 # 95%
# Significance level for the confidence intervals
alpha = 1 - confidence
# Number of tails
tails = 2
# Quantile (the cumulative probability)
q = 1 - (alpha / tails)
# Degrees of freedom
dof = n - 1
# Critical t-statistic, calculated using the percent-point function (aka the
# quantile function) of the t-distribution
talpha = st.t.ppf(q, dof)
Here comes the new stuff: Lu et al calculate the probabilities of
Type II errors (\(\beta_1\) and \(\beta_2\)) explicitly, assuming a
non-central t-distribution with non-centrality parameters \(\tau_1\) and \(\tau_2\). The function that does this in
Python is nct.cdf()
for a ‘non-central t
cumulative distribution function’:
#
# Estimate type-II error using non-central t-distribution
#
# Non-centrality parameters
tau1 = (δ - bias - zgamma * s) / se
tau2 = (δ + bias - zgamma * s) / se
# Cumulative distribution function for a non-central Student's t continuous
# random variable
beta1 = st.nct.cdf(talpha, dof, tau1) # Lu (2016) equation 3
beta2 = st.nct.cdf(talpha, dof, tau2) # Lu (2016) equation 4
The statistical power can now be estimated after combining the two
Type II error probabilities:
# Power
power = 1 - (beta1 + beta2) # Lu (2016) equation 5
This has given us a value for the power for every one of the
candidate sample sizes we initialised. We’re only interested in the ones
that are large enough to make our test robust, and if we use a cutoff of
0.80 we can find the exact value of \(n\) where the power crosses this cutoff for
the first time:
# Use a cutoff of 0.80 for the power (power = 1 - β)
desired_power = 0.80
# Find where the power reaches this level for the first time
i, p = next((i, v) for i, v in enumerate(power) if v >= desired_power)
# Find the corresponding value for the sample size
n = n[i]
print(f'A sample size of {n} will ensure that the test has a statistical power of {round(p, 3)}')
## A sample size of 79 will ensure that the test has a statistical power of 0.802
Is this right?
We have calculated our required \(n\) to be 79. However, in Section 5 of Lu
et al (2016) they reported that they got an \(n\) of 83 for the exact same example!
Moreover, one of the MedCalc
pages also gets an answer of 83 using this data and method! The reason
for this seems to be that both Lu and MedCalc used an approximation of
the full method when doing this example: specifically, they probably
used Equation 6 from the 2016 paper. When using the PASS (Power Analysis
& Sample Size) programme from NCSS - often considered a gold
standard for sample size calculations and which uses Equation 5 from the
2016 paper - the answer is confirmed as being 79.
Thanks to Aaron Caldwell for advising on the above.
Method 3:
blandPowerCurve()
from SimplyAgree in R
Importing the SimplyAgree library in R gives us access to the
blandPowerCurve()
function (see the documentation
and the source
code). Let’s use that to run the same example from Section 5 of Lu
et al:
# https://aaroncaldwell.us/SimplyAgree/
# install.packages("SimplyAgree", repos = "http://cran.us.r-project.org")
library(SimplyAgree)
powerCurve <- blandPowerCurve(
samplesizes = seq(2, 200, 1),
mu = 0.001167,
SD = 0.001129,
delta = 0.004,
conf.level = 0.95,
agree.level = 0.95
)
found_n <- find_n(powerCurve, power = 0.80)
n <- found_n[["N"]]
power <- found_n[["power"]]
print(c(n, power))
## [1] 79.0000000 0.8022956
We get \(n = 79\) and \(p = 0.802\), so we agree with the gold
standard solution.
Aside: Calculating
Bias and Standard Deviation from LOAs
Sometimes you’ll find a data source that reports the LOAs but not the
mean difference, \(\bar d\) (the bias)
or the sample standard deviation, \(s\). That’s not a problem, though, because
you can work these out by considering the definitions of the LOAs:
\[ {LOA}_{upper} = \bar d + 1.96 \times
s,\: {LOA}_{lower} = \bar d - 1.96 \times s\]
\[ \implies \bar d = {LOA}_{upper} - 1.96
\times s = {LOA}_{lower} + 1.96 \times s \]
\[ \implies s = \dfrac{ {LOA}_{upper} -
{LOA}_{lower} }{2 \times 1.96} \]
Here it is in Python:
# Limits of agreement
lower_loa = -0.0010467586944627883
upper_loa = 0.0033800920277961216
# Sample standard deviation
s = (upper_loa - lower_loa) / (2 * 1.96)
# Mean difference
bias = upper_loa - 1.96 * s
print(f'Bias = {round(bias, 6)}, standard deviation = {round(s, 6)}')
## Bias = 0.001167, standard deviation = 0.001129
These values match those given in Section 5 of Lu et al.
For Studies Using ACs
Related to the Bias
A good test for an acceptably small bias is to see if it is
equivalent to zero. For this, the TOST (two one-sided t-tests)
procedure can be used (see Schuirmann (1987), the Wikipedia page
and this
tutorial). Specifically, we want to use the one-sample TOST because
that will compare the one group of data points we have with the zero
line (the fact that our data points are the differences between repeated
measurements is irrelevant - using the two-sample or paired-sample TOST
would be incorrect for this application).
For the acceptance criteria, let’s use 10% of the AC that was applied
to the LOAs. In other words, the region of equivalence in which the bias
must lie in order for us to conclude that it is equivalent to zero is
one-tenth as wide as the region used for the LOAs.
# Acceptance criterion
Δ = 0.0004 # mmol/L
We again calculate the average and the sample standard deviation of
the differences between the pairs of raw data points:
# Raw data
measurement1 = df['Measurement 1 (mmol/L)']
measurement2 = df['Measurement 2 (mmol/L)']
# Means
means = (measurement1 + measurement2) / 2
# Differences
diffs = measurement1 - measurement2
# Average difference (aka the bias)
bias = np.mean(diffs)
# Sample standard deviation
s = np.std(diffs, ddof=1) # Use ddof=1 to get the sample standard deviation
Now, while there is currently no function in Python to calculate the
power of the TOST procedure in the way that we want, there is a
function in R: power_t_TOST()
from the TOSTER library (see
the documentation here
and here
and an example here).
Looking at the source
code allows us to ‘translate’ some of it into Python:
#
# power_t_TOST.R
# Calculates the exact power of two one-sided t-tests (TOST) for one, two, and
# paired samples.
#
# Assumed true difference
delta = 0
# Equivalence bound
high_eqbound = Δ
low_eqbound = -Δ
# Significance level, α
alpha = 0.05
# Desired statistical power, 1 - β
desired_power = 0.80
# Iterate through many potential values for the sample size
for n in np.arange(2, 100):
# Degrees of freedom
dof = n - 1
nc = np.sum(1 / n)
se_fac = np.sqrt(1 * nc)
# Standard error of the mean
sem = s * se_fac
# Non-centrality parameters
delta1 = (delta - low_eqbound) / sem
delta2 = (delta - high_eqbound) / sem
# Cumulative distribution function for a non-central Student's t continuous
# random variable
p1 = st.nct.cdf(0, dof, delta1)
p2 = st.nct.cdf(0, dof, delta2)
# Power
power = p2 - p1
if power > desired_power:
break
print(f'A full study will need {n} participants to have a power of {round(power, 3)}')
## A full study will need 14 participants to have a power of 0.815
Remember that \(n\) is the number of
pairs of data points, ie 28 total measurements will need to be
made.
⇦ Back