If you have:
…then the two-sample Z-test for a proportion might be for you. Consult the following flowchart for a more complete decision-making process:
This test, like all Z-tests, involves calculating a Z-statistic which, loosely, is the distance between what you have (the difference between the proportions actually seen in your samples) and what random chance would give you (the difference between the proportions that would be seen in samples that were perfectly randomly-generated). Alternatively, the Z-statistic can be thought of as a signal-to-noise ratio: a large value would indicate that the difference (between the two samples’ proportions) is large relative to random variation (a difference that could occur by chance).
Under the assumption that Z-statistics are normally-distributed, we can then calculate how unlikely it is that your samples have produced the Z-statistic that they have. More specifically, the Z-statistic is compared with its reference distribution (the standard normal distribution) to return a p-value.
The code on this page uses the numpy
, scipy
and statsmodels
packages. These can be installed from the terminal with:
$ python3.11 -m pip install numpy
$ python3.11 -m pip install scipy
$ python3.11 -m pip install statsmodels
where python3.11
corresponds to the version of Python you have installed and are using.
As an example we will use the “Spector and Mazzeo (1980) - Program Effectiveness Data” which is included in the statsmodels
package as spector
, see here for more. This dataset records the test results of students:
import statsmodels.api as sm
# Load the data set as a dataframe-of-dataframes
data = sm.datasets.spector.load_pandas()
# Extract the complete dataset
df = data['data']
# Rename
df = df.rename(columns={
'TUCE': 'test_score',
'PSI': 'participated',
'GRADE': 'grade_improved'
})
print(df[15:22])
## GPA test_score participated grade_improved
## 15 2.74 19.0 0.0 0.0
## 16 2.75 25.0 0.0 0.0
## 17 2.83 19.0 0.0 0.0
## 18 3.12 23.0 1.0 0.0
## 19 3.16 25.0 1.0 1.0
## 20 2.06 22.0 1.0 0.0
## 21 3.62 28.0 1.0 1.0
One of the columns in our dataset is grade_improved
which records whether or not the student’s test scores improved (1
for yes, 0
for no). Another column is participated
which indicates if that student participated in a programme where they received more personalised teaching (again, 1
for yes, 0
for no). We can hypothesise that a different proportion of students who got the personalised teaching saw their scores improve compared to those who didn’t:
or
where \(\pi\) is the proportion of students whose grades improved and where group 0 is those who did not participate in the programme while group 1 is those who did. Stated in words the hypotheses are:
We test \(H_0\) against \(H_1\) using a two-sample Z-test for a proportion:
import numpy as np
from statsmodels.stats.proportion import proportions_ztest
# Subset out the two groups
group0 = df[df['participated'] == 0]
group1 = df[df['participated'] == 1]
# Number of successes
count = np.array([
len(group0[group0['grade_improved'] == 1]),
len(group1[group1['grade_improved'] == 1])
])
# Number of observations
nobs = np.array([len(group0), len(group1)])
# Perform a two-sample Z-test for a proportion
z_stat, p_value = proportions_ztest(count, nobs)
print(f'Z-statistic = {z_stat:.3f}; p = {p_value:.3f}')
## Z-statistic = -2.391; p = 0.017
# Proportion successful
pi = count / nobs
print(
f'Proportion of students who improved after not participating in the programme: {pi[0]:.1%}',
f'\nProportion of students who improved after participating in the programme: {pi[1]:.1%}'
)
## Proportion of students who improved after not participating in the programme: 16.7%
## Proportion of students who improved after participating in the programme: 57.1%
An important insight to note is that this p-value is the same as that of Pearson’s chi-squared (pronounced “kai-squared”) independence test:
from scipy import stats
# Are two distributions independent?
count1 = np.array([
len(group0[group0['grade_improved'] == 1]),
len(group1[group1['grade_improved'] == 1])
])
count0 = np.array([
len(group0[group0['grade_improved'] == 0]),
len(group1[group1['grade_improved'] == 0])
])
observations = np.array(
[
count1,
count0
]
)
row_totals = np.array([np.sum(observations, axis=1)])
col_totals = np.array([np.sum(observations, axis=0)])
n = np.sum(observations)
# Calculate the expected observations
expected = np.dot(row_totals.T, col_totals) / n
# Calculate the chi-square test statistic
chisq, p = stats.chisquare(observations, expected)
chisq = np.sum(chisq)
# Degrees of freedom
rows = observations.shape[0]
cols = observations.shape[1]
df = (rows - 1) * (cols - 1)
# Convert chi-square test statistic to p-value
p = 1 - stats.chi2.cdf(chisq, df)
print(f'p = {p}')
## p = 0.01678005766053403
This isn’t a fluke: both tests have the same hypotheses so we would expect the same results!
Use the binomial proportion confidence interval:
# Difference of the proportions
diff_pi = pi[1] - pi[0]
# Standard errors of the proportions
se = np.sqrt((pi * (1 - pi)) / nobs)
# Standard error of the difference of the proportions
se_diff_pi = np.sqrt(np.sum(se**2))
# Significance level
alpha = 0.05
# Percent-point function (aka quantile function) of the normal distribution
z_critical = stats.norm.ppf(1 - (alpha / 2))
# Margin of error
d = z_critical * se_diff_pi
# Confidence interval
ci_lower = diff_pi - d
ci_upper = diff_pi + d
print(
f'π₁ - π₀ = {diff_pi:.1%}, 95% CI [{ci_lower:.1%}, {ci_upper:.1%}]',
f'\nThe proportion of group 1 that improved is {abs(diff_pi):.1%} ± {d:.1%} greater than group 2.'
)
## π₁ - π₀ = 40.5%, 95% CI [9.4%, 71.6%]
## The proportion of group 1 that improved is 40.5% ± 31.1% greater than group 2.
Instead of a Z-test we could consider using an exact test:
# Expected counts
# (the frequencies/proportions we expect to see under the null hypothesis)
pi_overall = np.sum(count) / np.sum(nobs)
expected_counts = pi_overall * nobs
print(expected_counts)
## [6.1875 4.8125]
# The most common exact test for comparing two proportions is a Fisher's exact
# test (when there are expected counts lower than 5)
table = [
[count[0], nobs[0] - count[0]],
[count[1], nobs[1] - count[1]]
]
odds_ratio, p_value = stats.fisher_exact(table)
print(f"Fisher's exact test: p = {p_value:.3f}")
## Fisher's exact test: p = 0.027
Remember that a difference must be statistically significant AND sufficiently large to be of practical importance. For example, the Higher Education Statistics Agency (HESA) requires differences between groups under investigation to be: