Cohen’s kappa is a measure of how much two judges agree with each other when they are rating things qualitatively.
Cohen’s kappa can be quickly found using the scikit-learn Python package. This can be installed from the terminal with:
python3.11 -m pip install sklearn
Here are two examples that demonstrate how to use the cohen_kappa_score()
function from this package:
The simple example given on the Wikipedia page involves two people each reading 50 proposals and saying either “Yes” or “No” to each. The data looks as follows:
Reader B | |||
---|---|---|---|
Yes | No | ||
Reader A | Yes | 20 | 5 |
No | 10 | 15 |
This type of table where the agreement between two raters or tests is being shown is known as a confusion matrix
Using this information, we can manually reverse-engineer the raw data: the response of each reader to each proposal must be as follows (where Y
stands for “yes” and N
stands for “no”):
readerA = [
'Y', 'Y', 'Y', 'Y', 'Y', 'Y', 'Y', 'Y', 'Y', 'Y',
'Y', 'Y', 'Y', 'Y', 'Y', 'Y', 'Y', 'Y', 'Y', 'Y',
'Y', 'Y', 'Y', 'Y', 'Y', 'N', 'N', 'N', 'N', 'N',
'N', 'N', 'N', 'N', 'N', 'N', 'N', 'N', 'N', 'N',
'N', 'N', 'N', 'N', 'N', 'N', 'N', 'N', 'N', 'N',
]
readerB = [
'Y', 'Y', 'Y', 'Y', 'Y', 'Y', 'Y', 'Y', 'Y', 'Y',
'Y', 'Y', 'Y', 'Y', 'Y', 'Y', 'Y', 'Y', 'Y', 'Y',
'N', 'N', 'N', 'N', 'N', 'Y', 'Y', 'Y', 'Y', 'Y',
'Y', 'Y', 'Y', 'Y', 'Y', 'N', 'N', 'N', 'N', 'N',
'N', 'N', 'N', 'N', 'N', 'N', 'N', 'N', 'N', 'N',
]
Ok, we haven’t fully reverse-engineered the raw data because we don’t know specifically which proposals each reader accepted and rejected, but we have determined a generalised version of their responses which is all we need.
If you want to calculate Cohen’s kappa directly from the confusion matrix (ie without reverse-engineering the raw data) then see the ‘deeper dive’ worked examples below
The Python package to use is sklearn.metrics
so import that into your script:
import sklearn.metrics
The first function of use is confusion_matrix
which, as the name suggests, re-creates the confusion matrix we reverse-engineered to get the raw data:
cm = sklearn.metrics.confusion_matrix(readerA, readerB)
print(cm)
## [[15 10]
## [ 5 20]]
This matrix has the same numbers as the one we were given but in a different configuration. Fortunately, we can manually re-order the labels in order to fix this, using the labels
keyword argument:
cm = sklearn.metrics.confusion_matrix(readerA, readerB, labels=['Y', 'N'])
print(cm)
## [[20 5]
## [10 15]]
This is now the same as what we started with, which is confirmation that we correctly reverse-engineered the raw data.
Now the Cohen’s kappa can be calculated, using the cohen_kappa_score
function:
kappa = sklearn.metrics.cohen_kappa_score(readerA, readerB)
print(kappa)
## 0.4
This value of 0.4 is the same as the solution on the original page.
Another worked example can be found on the Real Statistics site. This one is more complicated because the two judges are assigning observations into three groups as opposed to the previous example which only had two groups (“yes” and “no”). Specifically, this example involves two psychologists who are diagnosing patients as one of “psychotic”, “borderline” or “neither”:
Psychologist 1 | ||||
---|---|---|---|---|
Psychotic | Borderline | Neither | ||
Psychologist 2 | Psychotic | 10 | 4 | 1 |
Borderline | 6 | 16 | 2 | |
Neither | 0 | 3 | 8 |
Again, we can manually reverse-engineer the raw data by looking at the above confusion matrix. Create a list for each of the psychologist’s diagnoses and fill them in them with ‘P’ for “psychotic”, ‘B’ for “borderline” and ‘N’ for “neither”:
psychologist1 = [
'P', 'P', 'P', 'P', 'P', 'P', 'P', 'P', 'P', 'P',
'P', 'P', 'P', 'P', 'P', 'P', 'B', 'B', 'B', 'B',
'B', 'B', 'B', 'B', 'B', 'B', 'B', 'B', 'B', 'B',
'B', 'B', 'B', 'B', 'B', 'B', 'B', 'B', 'B', 'N',
'N', 'N', 'N', 'N', 'N', 'N', 'N', 'N', 'N', 'N',
]
psychologist2 = [
'P', 'P', 'P', 'P', 'P', 'P', 'P', 'P', 'P', 'P',
'B', 'B', 'B', 'B', 'B', 'B', 'B', 'B', 'B', 'B',
'B', 'B', 'B', 'B', 'B', 'B', 'B', 'B', 'B', 'B',
'B', 'B', 'P', 'P', 'P', 'P', 'N', 'N', 'N', 'N',
'N', 'N', 'N', 'N', 'N', 'N', 'N', 'P', 'B', 'B',
]
…and re-produce the confusion matrix:
cm = sklearn.metrics.confusion_matrix(psychologist2, psychologist1, labels=['P', 'B', 'N'])
print(cm)
## [[10 4 1]
## [ 6 16 2]
## [ 0 3 8]]
…and calculate the Cohen’s kappa:
kappa = sklearn.metrics.cohen_kappa_score(psychologist1, psychologist2)
print(kappa)
## 0.49590422180214233
This agrees with the value given on the original page.
If you want to calculate Cohen’s kappa directly from the confusion matrix (ie without reverse-engineering the raw data) then take a look at the ‘deeper dive’ worked examples which follow:
While the cohen_kappa_score()
function produces the correct result it hides how the calculation is performed. This is normally fine but sometimes we need to use the intermediate values produced by the calculation steps, for example if we want to calculate confidence intervals. So it’s worthwhile being able to perform the calculation in ‘long-hand’.
Both the Wikipedia page and the scikit-learn source code detail how to calculate Cohen’s kappa using essentially the same method. We start with the confusion matrix as created in the previous example:
cm = sklearn.metrics.confusion_matrix(readerA, readerB, labels=['Y', 'N'])
print(cm)
## [[20 5]
## [10 15]]
We will need to use the Numpy package for this, so remember to import that:
import numpy as np
Now, we create the ‘expected matrix’ (the probabilities of each type of agreement and disagreement between these readers that you would expect to see due to chance alone):
# Sample size
n = np.sum(cm)
# Expected matrix
sum0 = np.sum(cm, axis=0)
sum1 = np.sum(cm, axis=1)
expected = np.outer(sum0, sum1) / n
The Wikipedia page then carries on the calculation as follows:
# Number of classes
n_classes = cm.shape[0]
# Calculate p_o (the observed proportionate agreement) and
# p_e (the probability of random agreement)
identity = np.identity(n_classes)
p_o = np.sum((identity * cm) / n)
p_e = np.sum((identity * expected) / n)
# Calculate Cohen's kappa
kappa = (p_o - p_e) / (1 - p_e)
print(f'p_o = {p_o}, p_e = {p_e}, kappa = {kappa:3.1f}')
## p_o = 0.7, p_e = 0.5, kappa = 0.4
These results match those on the Wikipedia page itself.
The method used by the cohen_kappa_score()
function (as seen from its source code) looks a little different because it includes the option of using weights to alter how the score is calculated. Here’s how it looks when the weights are ‘none’ (ie when it’s used equivalently to the Wikipedia method):
# Number of classes
n_classes = cm.shape[0]
# The following assumes you are not using weights
w_mat = np.ones([n_classes, n_classes], dtype=int)
w_mat.flat[:: n_classes + 1] = 0
k = np.sum(w_mat * cm) / np.sum(w_mat * expected)
# Calculate Cohen's kappa
kappa = 1 - k
print(f'kappa = {kappa}')
## kappa = 0.4
Unsurprisingly, the method used on the Real Statistics page is the same as that on Wikipedia and in the scikit-learn source code. However, they go through the process a bit more slowly, so it’s worthwhile re-creating their steps here. First, re-create the confusion matrix:
cm = sklearn.metrics.confusion_matrix(psychologist2, psychologist1, labels=['P', 'B', 'N'])
print(cm)
## [[10 4 1]
## [ 6 16 2]
## [ 0 3 8]]
Calculate the agreement, the agreement by chance and the sample size from the confusion matrix:
# Sample size
n = np.sum(cm)
# Number of classes
n_classes = cm.shape[0]
# Agreement
agreement = 0
for i in np.arange(n_classes):
# Sum the diagonal values
agreement += cm[i, i]
# Agreement due to chance
judge1_totals = np.sum(cm, axis=0)
judge2_totals = np.sum(cm, axis=1)
judge1_totals_prop = np.sum(cm, axis=0) / n
judge2_totals_prop = np.sum(cm, axis=1) / n
by_chance = np.sum(judge1_totals_prop * judge2_totals_prop * n)
# Calculate Cohen's kappa
kappa = (agreement - by_chance) / (n - by_chance)
print(f'agreement = {agreement}, by_chance = {by_chance:.3f}, kappa = {kappa:.3f}')
## agreement = 34, by_chance = 18.260, kappa = 0.496
These results are the same as on the original page.
Adding a confidence interval to the Cohen’s kappa value is slightly more complicated because the scikit-learn code does not do this for you and the Wikipedia and Real Statistics methods differ slightly.
This is explained on the Wikipedia page and can be coded up as follows:
se = np.sqrt((p_o * (1 - p_o)) / (n * (1 - p_e)**2))
ci = 1.96 * se * 2
lower = kappa - 1.96 * se
upper = kappa + 1.96 * se
print(f'standard error = {se}')
print(f'lower confidence interval = {lower}')
print(f'upper confidence interval = {upper}')
Using this formula on the data from the Wikipedia example gives:
# Confusion matrix
cm = sklearn.metrics.confusion_matrix(readerA, readerB, labels=['Y', 'N'])
# Sample size
n = np.sum(cm)
# Expected matrix
sum0 = np.sum(cm, axis=0)
sum1 = np.sum(cm, axis=1)
expected = np.outer(sum0, sum1) / n
# Number of classes
n_classes = cm.shape[0]
# Calculate p_o (the observed proportionate agreement) and
# p_e (the probability of random agreement)
identity = np.identity(n_classes)
p_o = np.sum((identity * cm) / n)
p_e = np.sum((identity * expected) / n)
# Calculate Cohen's kappa
kappa = (p_o - p_e) / (1 - p_e)
# Confidence intervals
se = np.sqrt((p_o * (1 - p_o)) / (n * (1 - p_e)**2))
ci = 1.96 * se * 2
lower = kappa - 1.96 * se
upper = kappa + 1.96 * se
print(
f'p_o = {p_o}, p_e = {p_e}, kappa = {kappa:.1f}\n',
f'standard error = {se:.3f}\n',
f'lower confidence interval = {lower:.3f}\n',
f'upper confidence interval = {upper:.3f}', sep=''
)
## p_o = 0.7, p_e = 0.5, kappa = 0.4
## standard error = 0.130
## lower confidence interval = 0.146
## upper confidence interval = 0.654
And using this formula on the data from the Real Statistics example gives:
## p_o = 0.68, p_e = 0.365, kappa = 0.496
## standard error = 0.104
## lower confidence interval = 0.292
## upper confidence interval = 0.700
Note that this confidence interval of 0.292 to 0.700 is slightly different to the one given on the original page (which was 0.288 to 0.704). However, the fact that they are the same up to the second decimal place suggests that they are generally equivalent!
This come from the Real Statistics page:
import scipy.stats
# Expected matrix
sum0 = np.sum(cm, axis=0)
sum1 = np.sum(cm, axis=1)
expected = np.outer(sum0, sum1) / n
# Number of classes
n_classes = cm.shape[0]
# Calculate p_o (the observed proportionate agreement) and
# p_e (the probability of random agreement)
identity = np.identity(n_classes)
p_o = np.sum((identity * cm) / n)
p_e = np.sum((identity * expected) / n)
# Calculate a
ones = np.ones([n_classes, n_classes])
row_sums = np.inner(cm, ones)
col_sums = np.inner(cm.T, ones).T
sums = row_sums + col_sums
a_mat = cm / n * (1 - sums / n * (1 - kappa))**2
a = np.sum(identity * a_mat)
# Calculate b
b_mat = cm / n * (sums / n)**2
b_mat = b_mat * (ones - identity)
b = (1 - kappa)**2 * np.sum(b_mat)
# Calculate c
c = (kappa - p_e * (1 - kappa))**2
# Standard error
se = np.sqrt((a + b - c) / n) / (1 - p_e)
# Two-tailed statistical test
alpha = 0.05
z_crit = scipy.stats.norm.ppf(1 - alpha / 2)
ci = se * z_crit * 2
lower = kappa - se * z_crit
upper = kappa + se * z_crit
print(f'a = {a}, b = {b}, c = {c}')
print(f'standard error = {se}')
print(f'lower confidence interval = {lower}')
print(f'upper confidence interval = {upper}')
Using this method on the data from the Wikipedia example gives:
import scipy.stats
# Confusion matrix
cm = sklearn.metrics.confusion_matrix(readerA, readerB, labels=['Y', 'N'])
#
# Cohen's kappa
#
# Sample size
n = np.sum(cm)
# Number of classes
n_classes = cm.shape[0]
# Agreement
agreement = 0
for i in np.arange(n_classes):
# Sum the diagonal values
agreement += cm[i, i]
# Agreement due to chance
judge1_totals = np.sum(cm, axis=0)
judge2_totals = np.sum(cm, axis=1)
judge1_totals_prop = np.sum(cm, axis=0) / n
judge2_totals_prop = np.sum(cm, axis=1) / n
by_chance = np.sum(judge1_totals_prop * judge2_totals_prop * n)
# Calculate Cohen's kappa
kappa = (agreement - by_chance) / (n - by_chance)
#
# Confidence interval
#
# Expected matrix
sum0 = np.sum(cm, axis=0)
sum1 = np.sum(cm, axis=1)
expected = np.outer(sum0, sum1) / n
# Number of classes
n_classes = cm.shape[0]
# Calculate p_o (the observed proportionate agreement) and
# p_e (the probability of random agreement)
identity = np.identity(n_classes)
p_o = np.sum((identity * cm) / n)
p_e = np.sum((identity * expected) / n)
# Calculate a
ones = np.ones([n_classes, n_classes])
row_sums = np.inner(cm, ones)
col_sums = np.inner(cm.T, ones).T
sums = row_sums + col_sums
a_mat = cm / n * (1 - sums / n * (1 - kappa))**2
identity = np.identity(n_classes)
a = np.sum(identity * a_mat)
# Calculate b
b_mat = cm / n * (sums / n)**2
b_mat = b_mat * (ones - identity)
b = (1 - kappa)**2 * np.sum(b_mat)
# Calculate c
c = (kappa - p_e * (1 - kappa))**2
# Standard error
se = np.sqrt((a + b - c) / n) / (1 - p_e)
# Two-tailed statistical test
alpha = 0.05
z_crit = scipy.stats.norm.ppf(1 - alpha / 2)
ci = se * z_crit * 2
lower = kappa - se * z_crit
upper = kappa + se * z_crit
print(
f'kappa = {kappa}\n',
f'a = {a:.3f}, b = {b:.3f}, c = {c:.3f}\n',
f'standard error = {se:.3f}\n',
f'lower confidence interval = {lower:.3f}\n',
f'upper confidence interval = {upper:.3f}',
sep=''
)
## kappa = 0.4
## a = 0.110, b = 0.116, c = 0.010
## standard error = 0.131
## lower confidence interval = 0.142
## upper confidence interval = 0.658
And using it to do the Real Statistics example:
## kappa = 0.496
## a = 0.280, b = 0.045, c = 0.097
## standard error = 0.106
## lower confidence interval = 0.28767
## upper confidence interval = 0.70414
Now the confidence interval (0.28767 to 0.70414) exactly matches the original one.