1 Worked Examples - High Level
- 1.1 Wikipedia Example
- 1.2 Real Statistics Example
2 Worked Examples - A Deeper Dive
- 2.1 Wikipedia/Source Code
- 2.2 Real Statistics
3 Confidence Intervals
- 3.1 Wikipedia Formula
- 3.2 Real Statistics Method

Cohen’s kappa is a measure of how much two judges agree with each other when they are rating things qualitatively.

Another name for ‘judges’ in this context is raters because they are rating things, hence Cohen’s kappa is used to measure inter-rater reliability
The exact same process can be used to measure the degree to which a single judge agrees with themself (ie do they get the same result when they measure the same thing twice?). This is known as intra-rater reliability
Cohen’s kappa is used when things are rated qualitatively, meaning that they are being put into categories (classified) rather than being measured numerically
See the Wikipedia page for more

1 Worked Examples - High Level

Cohen’s kappa can be quickly found using the scikit-learn Python package. This can be installed from the terminal with:

python3.11 -m pip install sklearn

Here are two examples that demonstrate how to use the cohen_kappa_score() function from this package:

1.1 Wikipedia Example

The simple example given on the Wikipedia page involves two people each reading 50 proposals and saying either “Yes” or “No” to each. The data looks as follows:

		Reader B
		Yes	No
Reader A	Yes	20	5
Reader A	No	10	15

This type of table where the agreement between two raters or tests is being shown is known as a confusion matrix

Using this information, we can manually reverse-engineer the raw data: the response of each reader to each proposal must be as follows (where Y stands for “yes” and N stands for “no”):

readerA = [
    'Y', 'Y', 'Y', 'Y', 'Y', 'Y', 'Y', 'Y', 'Y', 'Y',
    'Y', 'Y', 'Y', 'Y', 'Y', 'Y', 'Y', 'Y', 'Y', 'Y',
    'Y', 'Y', 'Y', 'Y', 'Y', 'N', 'N', 'N', 'N', 'N',
    'N', 'N', 'N', 'N', 'N', 'N', 'N', 'N', 'N', 'N',
    'N', 'N', 'N', 'N', 'N', 'N', 'N', 'N', 'N', 'N',
]
readerB = [
    'Y', 'Y', 'Y', 'Y', 'Y', 'Y', 'Y', 'Y', 'Y', 'Y',
    'Y', 'Y', 'Y', 'Y', 'Y', 'Y', 'Y', 'Y', 'Y', 'Y',
    'N', 'N', 'N', 'N', 'N', 'Y', 'Y', 'Y', 'Y', 'Y',
    'Y', 'Y', 'Y', 'Y', 'Y', 'N', 'N', 'N', 'N', 'N',
    'N', 'N', 'N', 'N', 'N', 'N', 'N', 'N', 'N', 'N',
]

Ok, we haven’t fully reverse-engineered the raw data because we don’t know specifically which proposals each reader accepted and rejected, but we have determined a generalised version of their responses which is all we need.

If you want to calculate Cohen’s kappa directly from the confusion matrix (ie without reverse-engineering the raw data) then see the ‘deeper dive’ worked examples below

The Python package to use is sklearn.metrics so import that into your script:

import sklearn.metrics

The first function of use is confusion_matrix which, as the name suggests, re-creates the confusion matrix we reverse-engineered to get the raw data:

cm = sklearn.metrics.confusion_matrix(readerA, readerB)
print(cm)

## [[15 10]
##  [ 5 20]]

This matrix has the same numbers as the one we were given but in a different configuration. Fortunately, we can manually re-order the labels in order to fix this, using the labels keyword argument:

cm = sklearn.metrics.confusion_matrix(readerA, readerB, labels=['Y', 'N'])
print(cm)

## [[20  5]
##  [10 15]]

This is now the same as what we started with, which is confirmation that we correctly reverse-engineered the raw data.

Now the Cohen’s kappa can be calculated, using the cohen_kappa_score function:

kappa = sklearn.metrics.cohen_kappa_score(readerA, readerB)
print(kappa)

## 0.4

This value of 0.4 is the same as the solution on the original page.

1.2 Real Statistics Example

Another worked example can be found on the Real Statistics site. This one is more complicated because the two judges are assigning observations into three groups as opposed to the previous example which only had two groups (“yes” and “no”). Specifically, this example involves two psychologists who are diagnosing patients as one of “psychotic”, “borderline” or “neither”:

		Psychologist 1
		Psychotic	Borderline	Neither
Psychologist 2	Psychotic	10	4	1
	Borderline	6	16	2
	Neither	0	3	8

Again, we can manually reverse-engineer the raw data by looking at the above confusion matrix. Create a list for each of the psychologist’s diagnoses and fill them in them with ‘P’ for “psychotic”, ‘B’ for “borderline” and ‘N’ for “neither”:

psychologist1 = [
    'P', 'P', 'P', 'P', 'P', 'P', 'P', 'P', 'P', 'P',
    'P', 'P', 'P', 'P', 'P', 'P', 'B', 'B', 'B', 'B',
    'B', 'B', 'B', 'B', 'B', 'B', 'B', 'B', 'B', 'B',
    'B', 'B', 'B', 'B', 'B', 'B', 'B', 'B', 'B', 'N',
    'N', 'N', 'N', 'N', 'N', 'N', 'N', 'N', 'N', 'N',
]
psychologist2 = [
    'P', 'P', 'P', 'P', 'P', 'P', 'P', 'P', 'P', 'P',
    'B', 'B', 'B', 'B', 'B', 'B', 'B', 'B', 'B', 'B',
    'B', 'B', 'B', 'B', 'B', 'B', 'B', 'B', 'B', 'B',
    'B', 'B', 'P', 'P', 'P', 'P', 'N', 'N', 'N', 'N',
    'N', 'N', 'N', 'N', 'N', 'N', 'N', 'P', 'B', 'B',
]

…and re-produce the confusion matrix:

cm = sklearn.metrics.confusion_matrix(psychologist2, psychologist1, labels=['P', 'B', 'N'])

print(cm)

## [[10  4  1]
##  [ 6 16  2]
##  [ 0  3  8]]

…and calculate the Cohen’s kappa:

kappa = sklearn.metrics.cohen_kappa_score(psychologist1, psychologist2)

print(kappa)

## 0.49590422180214233

This agrees with the value given on the original page.

If you want to calculate Cohen’s kappa directly from the confusion matrix (ie without reverse-engineering the raw data) then take a look at the ‘deeper dive’ worked examples which follow:

2 Worked Examples - A Deeper Dive

While the cohen_kappa_score() function produces the correct result it hides how the calculation is performed. This is normally fine but sometimes we need to use the intermediate values produced by the calculation steps, for example if we want to calculate confidence intervals. So it’s worthwhile being able to perform the calculation in ‘long-hand’.

2.1 Wikipedia/Source Code

Both the Wikipedia page and the scikit-learn source code detail how to calculate Cohen’s kappa using essentially the same method. We start with the confusion matrix as created in the previous example:

cm = sklearn.metrics.confusion_matrix(readerA, readerB, labels=['Y', 'N'])

print(cm)

## [[20  5]
##  [10 15]]

We will need to use the Numpy package for this, so remember to import that:

import numpy as np

Now, we create the ‘expected matrix’ (the probabilities of each type of agreement and disagreement between these readers that you would expect to see due to chance alone):

# Sample size
n = np.sum(cm)
# Expected matrix
sum0 = np.sum(cm, axis=0)
sum1 = np.sum(cm, axis=1)
expected = np.outer(sum0, sum1) / n

The Wikipedia page then carries on the calculation as follows:

# Number of classes
n_classes = cm.shape[0]
# Calculate p_o (the observed proportionate agreement) and
# p_e (the probability of random agreement)
identity = np.identity(n_classes)
p_o = np.sum((identity * cm) / n)
p_e = np.sum((identity * expected) / n)
# Calculate Cohen's kappa
kappa = (p_o - p_e) / (1 - p_e)

print(f'p_o = {p_o}, p_e = {p_e}, kappa = {kappa:3.1f}')

## p_o = 0.7, p_e = 0.5, kappa = 0.4

These results match those on the Wikipedia page itself.

The method used by the cohen_kappa_score() function (as seen from its source code) looks a little different because it includes the option of using weights to alter how the score is calculated. Here’s how it looks when the weights are ‘none’ (ie when it’s used equivalently to the Wikipedia method):

# Number of classes
n_classes = cm.shape[0]
# The following assumes you are not using weights
w_mat = np.ones([n_classes, n_classes], dtype=int)
w_mat.flat[:: n_classes + 1] = 0
k = np.sum(w_mat * cm) / np.sum(w_mat * expected)
# Calculate Cohen's kappa
kappa = 1 - k

print(f'kappa = {kappa}')

## kappa = 0.4

2.2 Real Statistics

Unsurprisingly, the method used on the Real Statistics page is the same as that on Wikipedia and in the scikit-learn source code. However, they go through the process a bit more slowly, so it’s worthwhile re-creating their steps here. First, re-create the confusion matrix:

cm = sklearn.metrics.confusion_matrix(psychologist2, psychologist1, labels=['P', 'B', 'N'])
print(cm)

## [[10  4  1]
##  [ 6 16  2]
##  [ 0  3  8]]

Calculate the agreement, the agreement by chance and the sample size from the confusion matrix:

# Sample size
n = np.sum(cm)
# Number of classes
n_classes = cm.shape[0]
# Agreement
agreement = 0
for i in np.arange(n_classes):
    # Sum the diagonal values
    agreement += cm[i, i]
# Agreement due to chance
judge1_totals = np.sum(cm, axis=0)
judge2_totals = np.sum(cm, axis=1)
judge1_totals_prop = np.sum(cm, axis=0) / n
judge2_totals_prop = np.sum(cm, axis=1) / n
by_chance = np.sum(judge1_totals_prop * judge2_totals_prop * n)
# Calculate Cohen's kappa
kappa = (agreement - by_chance) / (n - by_chance)

print(f'agreement = {agreement}, by_chance = {by_chance:.3f}, kappa = {kappa:.3f}')

## agreement = 34, by_chance = 18.260, kappa = 0.496

These results are the same as on the original page.

3 Confidence Intervals

Adding a confidence interval to the Cohen’s kappa value is slightly more complicated because the scikit-learn code does not do this for you and the Wikipedia and Real Statistics methods differ slightly.

3.1 Wikipedia Formula

This is explained on the Wikipedia page and can be coded up as follows:

se = np.sqrt((p_o * (1 - p_o)) / (n * (1 - p_e)**2))
ci = 1.96 * se * 2
lower = kappa - 1.96 * se
upper = kappa + 1.96 * se

print(f'standard error = {se}')
print(f'lower confidence interval = {lower}')
print(f'upper confidence interval = {upper}')

Using this formula on the data from the Wikipedia example gives:

# Confusion matrix
cm = sklearn.metrics.confusion_matrix(readerA, readerB, labels=['Y', 'N'])
# Sample size
n = np.sum(cm)
# Expected matrix
sum0 = np.sum(cm, axis=0)
sum1 = np.sum(cm, axis=1)
expected = np.outer(sum0, sum1) / n
# Number of classes
n_classes = cm.shape[0]
# Calculate p_o (the observed proportionate agreement) and
# p_e (the probability of random agreement)
identity = np.identity(n_classes)
p_o = np.sum((identity * cm) / n)
p_e = np.sum((identity * expected) / n)
# Calculate Cohen's kappa
kappa = (p_o - p_e) / (1 - p_e)
# Confidence intervals
se = np.sqrt((p_o * (1 - p_o)) / (n * (1 - p_e)**2))
ci = 1.96 * se * 2
lower = kappa - 1.96 * se
upper = kappa + 1.96 * se

print(
    f'p_o = {p_o}, p_e = {p_e}, kappa = {kappa:.1f}\n',
    f'standard error = {se:.3f}\n',
    f'lower confidence interval = {lower:.3f}\n',
    f'upper confidence interval = {upper:.3f}', sep=''
)

## p_o = 0.7, p_e = 0.5, kappa = 0.4
## standard error = 0.130
## lower confidence interval = 0.146
## upper confidence interval = 0.654

And using this formula on the data from the Real Statistics example gives:

## p_o = 0.68, p_e = 0.365, kappa = 0.496
## standard error = 0.104
## lower confidence interval = 0.292
## upper confidence interval = 0.700

Note that this confidence interval of 0.292 to 0.700 is slightly different to the one given on the original page (which was 0.288 to 0.704). However, the fact that they are the same up to the second decimal place suggests that they are generally equivalent!

3.2 Real Statistics Method

This come from the Real Statistics page:

import scipy.stats

# Expected matrix
sum0 = np.sum(cm, axis=0)
sum1 = np.sum(cm, axis=1)
expected = np.outer(sum0, sum1) / n
# Number of classes
n_classes = cm.shape[0]
# Calculate p_o (the observed proportionate agreement) and
# p_e (the probability of random agreement)
identity = np.identity(n_classes)
p_o = np.sum((identity * cm) / n)
p_e = np.sum((identity * expected) / n)
# Calculate a
ones = np.ones([n_classes, n_classes])
row_sums = np.inner(cm, ones)
col_sums = np.inner(cm.T, ones).T
sums = row_sums + col_sums
a_mat = cm / n * (1 - sums / n * (1 - kappa))**2
a = np.sum(identity * a_mat)
# Calculate b
b_mat = cm / n * (sums / n)**2
b_mat = b_mat * (ones - identity)
b = (1 - kappa)**2 * np.sum(b_mat)
# Calculate c
c = (kappa - p_e * (1 - kappa))**2
# Standard error
se = np.sqrt((a + b - c) / n) / (1 - p_e)
# Two-tailed statistical test
alpha = 0.05
z_crit = scipy.stats.norm.ppf(1 - alpha / 2)
ci = se * z_crit * 2
lower = kappa - se * z_crit
upper = kappa + se * z_crit

print(f'a = {a}, b = {b}, c = {c}')
print(f'standard error = {se}')
print(f'lower confidence interval = {lower}')
print(f'upper confidence interval = {upper}')

Using this method on the data from the Wikipedia example gives:

import scipy.stats

# Confusion matrix
cm = sklearn.metrics.confusion_matrix(readerA, readerB, labels=['Y', 'N'])

#
# Cohen's kappa
#
# Sample size
n = np.sum(cm)
# Number of classes
n_classes = cm.shape[0]
# Agreement
agreement = 0
for i in np.arange(n_classes):
    # Sum the diagonal values
    agreement += cm[i, i]
# Agreement due to chance
judge1_totals = np.sum(cm, axis=0)
judge2_totals = np.sum(cm, axis=1)
judge1_totals_prop = np.sum(cm, axis=0) / n
judge2_totals_prop = np.sum(cm, axis=1) / n
by_chance = np.sum(judge1_totals_prop * judge2_totals_prop * n)
# Calculate Cohen's kappa
kappa = (agreement - by_chance) / (n - by_chance)

#
# Confidence interval
#
# Expected matrix
sum0 = np.sum(cm, axis=0)
sum1 = np.sum(cm, axis=1)
expected = np.outer(sum0, sum1) / n
# Number of classes
n_classes = cm.shape[0]
# Calculate p_o (the observed proportionate agreement) and
# p_e (the probability of random agreement)
identity = np.identity(n_classes)
p_o = np.sum((identity * cm) / n)
p_e = np.sum((identity * expected) / n)
# Calculate a
ones = np.ones([n_classes, n_classes])
row_sums = np.inner(cm, ones)
col_sums = np.inner(cm.T, ones).T
sums = row_sums + col_sums
a_mat = cm / n * (1 - sums / n * (1 - kappa))**2
identity = np.identity(n_classes)
a = np.sum(identity * a_mat)
# Calculate b
b_mat = cm / n * (sums / n)**2
b_mat = b_mat * (ones - identity)
b = (1 - kappa)**2 * np.sum(b_mat)
# Calculate c
c = (kappa - p_e * (1 - kappa))**2
# Standard error
se = np.sqrt((a + b - c) / n) / (1 - p_e)
# Two-tailed statistical test
alpha = 0.05
z_crit = scipy.stats.norm.ppf(1 - alpha / 2)
ci = se * z_crit * 2
lower = kappa - se * z_crit
upper = kappa + se * z_crit

print(
    f'kappa = {kappa}\n',
    f'a = {a:.3f}, b = {b:.3f}, c = {c:.3f}\n',
    f'standard error = {se:.3f}\n',
    f'lower confidence interval = {lower:.3f}\n',
    f'upper confidence interval = {upper:.3f}',
    sep=''
)

## kappa = 0.4
## a = 0.110, b = 0.116, c = 0.010
## standard error = 0.131
## lower confidence interval = 0.142
## upper confidence interval = 0.658

And using it to do the Real Statistics example:

## kappa = 0.496
## a = 0.280, b = 0.045, c = 0.097
## standard error = 0.106
## lower confidence interval = 0.28767
## upper confidence interval = 0.70414

Now the confidence interval (0.28767 to 0.70414) exactly matches the original one.

⇦ Back

Statistics in Python:Cohen’s Kappa