Some events have only two possible outcomes: a coin toss can result in either a head or a tail. Others can be simplified into having only two possible outcomes: rolling a die has six possible outcomes but we can simplify the result by saying that it is either a six or not a six. For these types of events - with binary outcomes - we can use the binomial test to establish if they are truly random or not to within a certain level of significance.
The code on this page uses the numpy
, scipy
and statsmodels
packages. These can be installed from the terminal with:
$ python3.11 -m pip install numpy
$ python3.11 -m pip install scipy
$ python3.11 -m pip install statsmodels
where python3.11
corresponds to the version of Python you have installed and are using.
Import these packages into Python with:
import numpy as np
from scipy import stats as st
from statsmodels.stats import proportion
This example comes from the Wikipedia page:
If we roll a die 235 times and 6 comes up 51 of those times, is the die fair?
We would expect a 6 to come up 1/6th of the time, ie \(235 / 6 = 39.17\) times in 235 rolls. So it certainly seems as if the die might be biased towards rolling a 6 more often than expected.
In more statistically-rigorous language, the number of successes we have observed is 51 out of a total of 235 trials where the hypothesized probability of success is 1/6. These numbers are usually known as k
, n
and p
respectively:
#
# Set up the example
#
# The number of successes
k = 51
# The number of trials
n = 235
# The hypothesized probability of success
p = 1 / 6
Under the null hypothesis that the die is indeed fair and thus follows a binomial distribution with \(n=235\) and \(p=1/6\) with regards to landing on 6, we can test whether the die in the example produces a greater than expected number of 6s:
with true proportion \(\pi\) and expected proportion \(\pi_0 = 1/6\).
Let’s do this first by simulating many (specifically, 10,000) batches of 235 rolls:
# Set the seed so that we get the same random numbers each time this code runs
np.random.seed(20230814)
# Count the number of times we get a more extreme (or equally extreme) number
# of 6s than the number we actually observed
more_extreme = 0
# Iterate through 10,000 simulations
i = 0
while i < 10000:
# Simulate rolling a die n times with the result either being a 6 (True)
# or not (False) with probability p
outcomes = np.random.choice([True, False], size=n, p=[p, 1 - p])
# Count the number of 6s
successes = np.sum(outcomes)
# If the number of 6s is more or equally as extreme as observed, count it
if successes >= k:
more_extreme += 1
# Advance the loop counter
i += 1
# Calculate the p-value
p_value = more_extreme / 10000
print(f'p = {p_value:.4f}')
## p = 0.0262
This is close to the true value result on the Wikipedia page: 0.02654.
SciPy has a binomtest()
function which will calculate the p-value exactly:
result = st.binomtest(k, n, p, alternative='greater')
print(f'p = {result.pvalue:.5f}')
## p = 0.02654
This matches Wikipedia.
See the documentation for this function here.
A more robust test would be to see if the die is more or less likely to roll a 6:
with true proportion \(\pi\) and expected proportion \(\pi_0 = 1/6\). Calculating the p-values is not a simple case of doubling the value from the one-sided test because the binomial distribution is not symmetric (unless \(p=0.5\)). We have to consider both tails individually:
Again, let’s simulate 10,000 batches of 235 die rolls:
# Count the number of times we get a more extreme (or equally extreme) number
# of 6s than the number we actually observed
more_extreme = 0
# k is the upper-tail, so calculate the lower-tail
expected = n * p
lower_tail = (expected - (k - expected))
# Iterate through 10,000 simulations
i = 0
while i < 10000:
# Simulate rolling a die n times with the result either being a 6 (True)
# or not (False) with probability p
outcomes = np.random.choice([1, 0], size=n, p=[p, 1 - p])
# Count the number of 6s
successes = np.sum(outcomes)
# If the number of 6s is more or equally as extreme as observed, count it
if successes >= k:
more_extreme += 1
# If the number of 6s is more or equally as extreme as the lower tail
if successes <= lower_tail:
more_extreme += 1
# Advance the loop counter
i += 1
# Calculate the p-value
p_value = more_extreme / 10000
print(f'p = {p_value:.4f}')
## p = 0.0421
This value is again close to Wikipedia’s answer, which is 0.0437.
result = st.binomtest(k, n, p)
print(f'p = {result.pvalue:.4f}')
## p = 0.0437
This matches Wikipedia. Using \(p = 0.05\) as our significance threshold (equivalently, our false positive rate: the probability of incorrectly finding a significant difference when there really is none) allows us to reject the null hypothesis and conclude that the die is not fair.
Both the \(\chi^2\) goodness-of-fit test and the one-sample Z-test for a proportion can be used to answer this example question:
# Perform the chi-squared goodness-of-fit test
observed = [51, 235 - 51]
expected = [235 / 6, 5 * 235 / 6]
chisq, p = st.chisquare(observed, expected)
print(f'χ² = {chisq:.2f}, p = {p:.3f}')
## χ² = 4.29, p = 0.038
# Perform a one-sample Z-test for a proportion
count = 51
nobs = 235
pi_0 = 1 / 6
z, p = proportion.proportions_ztest(count, nobs, value=pi_0, prop_var=pi_0)
print(f'Z = {z:.2f}, p = {p:.3f}')
## Z = 2.07, p = 0.038
Note that both tests give the same p-value and that the \(\chi^2\) statistic is the square of the Z statistic in binomial examples:
print(z**2 == round(chisq, 14))
## True
The reason that these tests give the same result is because the \(\chi^2\) goodness-of-fit test is derived from the Z-statistic.1 For more on \(\chi^2\) tests see the page here and for more on the one-sample Z-test for a proportion see the page here.
The binomial test is an exact test and so, in general, it should be used over the \(\chi^2\) test or the Z-test which are approximations.2 However, it’s worth noting that, due to the central limit theorem, as the sample size gets larger the binomial distribution starts getting more similar to the normal distribution and, hence, the results of the three tests converge3:
# Perform a binomial test with a very large sample size
k = 4000
n = 23500
p = 1 / 6
result = st.binomtest(k, n, p)
print(f'p = {result.pvalue:.3f}')
## p = 0.146
# Perform a one-sample Z-test for a proportion with a very large sample size
count = 4000
nobs = 23500
pi_0 = 1 / 6
z, p = proportion.proportions_ztest(count, nobs, value=pi_0, prop_var=pi_0)
print(f'p = {p:.3f}')
## p = 0.145
These values of 0.146 and 0.145 are very close! So, if our choice of test makes almost no difference when \(n\) is large, is there ever an instance when we would prefer the \(\chi^2\) or Z-test over the binomial? Here are some potential reasons4, 5:
So when can we use a Z-test instead of a binomial test? A rule of thumb is that \(p \times n\) and \((1 - p) \times n\) must both be greater than 5, where \(p\) is the hypothesized probability of success and \(n\) the sample size.
Wallis, S. “z-squared: The Origin and Application of χ²”. Journal of Quantitative Linguistics 2013; 20:350-378. DOI: 10.1080/09296174.2013.830554. Available here. Jump to reference: ↩︎
GraphPad Software, LLC. “The binomial test”. Statistics Guide 2023. Available here. Jump to reference: ↩︎
Soetewey, A. “One-proportion and chi-square goodness of fit test”. Stats and R 2020. Available here. Jump to reference: ↩︎
Van den Berg, RG. “Binomial Test – Simple Tutorial”. SPSS Tutorials 2023. Available here. Jump to reference: ↩︎
Hessing, T. “One and Two Sample Z Proportion Hypothesis Tests”. 6σ Study Guide. Available here. Jump to reference: ↩︎