This page replicates the example given on the Wikipedia page about the odds ratio:
Suppose a radiation leak in a village of 1,000 people increased the incidence of a rare disease. The total number of people exposed to the radiation was 400, out of which 20 developed the disease and 380 stayed healthy. The total number of people not exposed was 600, out of which 6 developed the disease and 594 stayed healthy.
We can tabulate the data from the above example as follows:
import pandas as pd
# Create a data frame from a dictionary
dct = {
'diseased': [20, 6],
'healthy': [380, 594],
}
df = pd.DataFrame(dct, index=['exposed', 'not_exposed'])
print(df)
## diseased healthy
## exposed 20 380
## not_exposed 6 594
It is more useful (and realistic) to start with this data in a ‘long format’, ie as a data frame with 1,000 rows where each corresponds to one person:
# Create a data frame from a dictionary
dct = {
'exposed': [True] * 400 + [False] * 600,
'diseased': [True] * 20 + [False] * 380 + [True] * 6 + [False] * 594,
}
df = pd.DataFrame(dct)
print(df.head())
## exposed diseased
## 0 True True
## 1 True True
## 2 True True
## 3 True True
## 4 True True
This ‘long format’ data can then be summarised in a pivot table that counts the number of people exposed and the number of people who are healthy:
pivot_table = pd.pivot_table(df, index='exposed', columns='diseased', aggfunc='size')
print(pivot_table)
## diseased False True
## exposed
## False 594 6
## True 380 20
Usually, a contingency table will have the positive (in a statistical sense) results first:
# Re-format the pivot table into a contingency table by reversing the order of the columns and the rows
contingency_table = pivot_table.iloc[::-1, ::-1]
print(contingency_table)
## diseased True False
## exposed
## True 20 380
## False 6 594
The risk of developing the disease given exposure, and of developing the disease given non-exposure, is equal to the number of people who became diseased (20 and 6) divided by the total number of people who were exposed or not exposed (400 and 600), respectively. In other words, we need to sum the values in the contingency table’s rows and divide the contingency table by those values:
# Get the total of each row
totals = contingency_table.sum(axis=1)
# Divide the values in the contingency table by the total of their row
risk = contingency_table.div(totals, axis=0)
print(risk)
## diseased True False
## exposed
## True 0.05 0.95
## False 0.01 0.99
The relative risk of developing the disease given expose vs non-exposure is simply one risk value divided by another: \(\dfrac{0.05}{0.01}\)
relative_risk = risk.loc[True, True] / risk.loc[False, True]
print(f'Relative risk: {relative_risk:1.0f}')
## Relative risk: 5
The odds of getting the disease if exposed is the ratio of the number of people that became diseased to the number that did not - ie 20 divided by 380 - and similar for those who were not exposed:
odds = contingency_table[True] / contingency_table[False]
print(odds)
## exposed
## True 0.052632
## False 0.010101
## dtype: float64
In summary:
The odds ratio is, unsurprisingly, the ratio of the two odds:
odds_ratio = odds[True] / odds[False]
print(f'Odds ratio: {odds_ratio:3.1f}')
## Odds ratio: 5.2
The Wikipedia page goes on to give a second example wherein the data from all 26 diseased villagers is included but only that from 26 of the healthy villagers is available (which is a more realistic scenario):
# Create a data frame from a dictionary
dct = {
'exposed': [True] * 30 + [False] * 22,
'diseased': [True] * 20 + [False] * 10 + [True] * 6 + [False] * 16,
}
df = pd.DataFrame(dct)
pivot_table = pd.pivot_table(df, index='exposed', columns='diseased', aggfunc='size')
# Re-format the pivot table into a contingency table by reversing the order of the columns and the rows
contingency_table = pivot_table.iloc[::-1, ::-1]
print(contingency_table)
## diseased True False
## exposed
## True 20 10
## False 6 16
The relative risk cannot be calculated because we don’t have data from the entire population, but we can get the odds ratio by follow the same steps as above:
odds = contingency_table[True] / contingency_table[False]
odds_ratio = odds[True] / odds[False]
print(f'Odds ratio: {odds_ratio:3.1f}')
## Odds ratio: 5.3
Another example on the Wikipedia page talks about a sample of 100 men where 90 drank wine in the previous week and a sample of 80 women where 20 drank wine in the same period:
# Create a data frame from a dictionary
dct = {
True: [90, 10],
False: [20, 60],
}
contingency_table = pd.DataFrame(dct, index=[True, False])
print(contingency_table)
## True False
## True 90 20
## False 10 60
The odds ratio is thus:
odds = contingency_table[True] / contingency_table[False]
odds_ratio = odds[True] / odds[False]
print(f'Odds ratio: {odds_ratio:2.0f}')
## Odds ratio: 27