When you have one set of numbers, their coefficient of variation (\(c_v\)) is equal to:
\(c_v = \dfrac{\sigma}{\mu}\)
where:
Here are worked examples which comes from the Wikipedia page about the coefficient of variation:
import numpy as np
data = [1, 5, 6, 8, 10, 40, 65, 88]
# Mean
μ = np.mean(data)
# Sample standard deviation
s = np.std(data, ddof=1)
# Coefficient of variation
cv = s / μ
print(f'cᵥ = {cv:4.2f}')
## cᵥ = 1.18
# Population standard deviation
σ = np.std(data, ddof=0)
# Coefficient of variation
cv = σ / μ
print(f'cᵥ = {cv:4.2f}')
## cᵥ = 1.10
These match the answers of 1.18 and 1.10 given on Wikipedia. Here’s another worked example that comes from the Influential Points page:
data = [280, 295, 245, 310, 285]
# Coefficient of variation
cv = np.std(data, ddof=1) / np.mean(data)
print(f'cᵥ = {cv:6.4f}')
## cᵥ = 0.0853
This matches their answer of 0.0853 (or 8.53% in percentage form).
When you have two (or more) sets of paired numbers the above formula is too simplistic; do you use the standard deviation of all of the numbers together, or the mean of the standard deviations of each of the sets? Or do you work it out for each pair of measurements and average that? There are two approaches to answering this (and they are almost identical so either can be used; see here): the root-mean-square method and the logarithmic method. A worked example of each will be done and both will use the following fake data (which comes from the same Influential Points page):
import pandas as pd
# Raw data
dct = {
'first_measurement': [33, 26, 29, 32, 31, 31, 31, 31, 35, 21],
'second_measurement': [35, 23, 29, 35, 28, 31, 34, 29, 33, 24],
}
df = pd.DataFrame(dct)
print(df)
## first_measurement second_measurement
## 0 33 35
## 1 26 23
## 2 29 29
## 3 32 35
## 4 31 28
## 5 31 31
## 6 31 34
## 7 31 29
## 8 35 33
## 9 21 24
This example data represents the results of an experiment where two repeated measurements were taken from 10 participants or samples (so 20 measurements in total), hence 10 pairs of numbers.
Here is the implementation of the method as described on the Influential Points page:
means = df.mean(axis=1)
diffs = df.diff(axis=1).iloc[:, -1]
# Within-subject variances
s2 = diffs**2 / 2
# Individual CVs
cv = np.sqrt(s2) / means
# Within-subject coefficient of variation
wcv = np.sqrt(np.mean(cv**2))
print(f'WCV = {wcv:6.4f}')
## WCV = 0.0596
This matches the correct answer, which is 5.96%.
Martin Bland (of Bland-Altman fame) gives a slightly-differently worded description of how to do the root-mean-square method on his page, but it is mathematically equivalent to the above and so it gives the same answer:
means = df.mean(axis=1, skipna=False)
diffs = df.diff(axis=1).iloc[:, -1]
# Within-subject variances
s2 = diffs**2 / 2
# Within-subject coefficient of variation
s2m2 = s2 / means**2
wcv = np.sqrt(np.mean(s2m2))
print(f'WCV = {wcv:6.4f}')
## WCV = 0.0596
Bland goes on to explain how to get the confidence interval around this result:
# Standard error
se = s2m2.std() / np.sqrt(len(df))
# 95% confidence interval for the within-subject CV
ci = (
np.sqrt(s2m2.mean() - 1.96 * se),
np.sqrt(s2m2.mean() + 1.96 * se)
)
print(f'WCV = {wcv:6.4f} (95% CI [{ci[0]:6.4f}, {ci[1]:6.4f}])')
## WCV = 0.0596 (95% CI [0.0411, 0.0736])
Out of interest, the “1.96” in the calculation above comes from the fact that this is the number of standard deviations around the mean within which 95% of values randomly selected from a normal distribution will lie. Here’s the way to calculate it exactly:
import scipy.stats as stats
# Significance level
alpha = 0.05 # 95% confidence level
# One- or two-tailed test?
tails = 2
# Cumulative probability
q = 1 - (alpha / tails)
# Percent-point function (aka quantile function) of the normal distribution
z_critical = stats.norm.ppf(q)
print(z_critical)
## 1.959963984540054
Still on Bland’s page but using the data from the Influential Points page (because Bland used randomly-generated data that cannot be replicated), here’s how the logarithmic method can be implemented:
# Log-transform the data
df_log = np.log(df)
means_log = df_log.mean(axis=1, skipna=False)
diffs_log = df_log.diff(axis=1).iloc[:, -1]
# Within-subject variances
s2_log = diffs_log**2 / 2
# Within-subject standard deviation
ws_log = np.sqrt(s2_log.mean())
# Within-subject coefficient of variation
wcv = np.exp(ws_log) - 1
print(f'WCV = {wcv:6.4f}')
## WCV = 0.0615
…and the confidence interval around this answer is:
# Number of subjects
n = df_log.shape[0]
# Number of measurements per subject
m = df_log.shape[1]
# Standard error
se = ws_log / np.sqrt(2 * n * (m - 1))
# 95% confidence interval for the within-subject CV
ci = (np.exp(ws_log - 1.96 * se) - 1, np.exp(ws_log + 1.96 * se) - 1)
print(f'WCV = {wcv:6.4f} (95% CI [{ci[0]:6.4f}, {ci[1]:6.4f}])')
## WCV = 0.0615 (95% CI [0.0341, 0.0896])
As noted by Bland, the results given by the two methods are similar (5.96% and 6.15% in our example) and so either can be used.
The Influential Points page goes on to calculate the between-subject coefficient of variation:
# Between-subject coefficient of variation
var_means = np.var(means, ddof=1)
mean_means = np.mean(means)
n_obs = df.shape[1]
bcv = np.sqrt(n_obs * var_means) / mean_means
print(f'BCV = {bcv:5.3f}')
## BCV = 0.185
This matches the correct answer of 18.5%.