⇦ Back

If you have:

…then the two-sample Z-test for a proportion might be for you. Consult the following flowchart for a more complete decision-making process:

This test, like all Z-tests, involves calculating a Z-statistic which, loosely, is the distance between what you have (the difference between the proportions actually seen in your samples) and what random chance would give you (the difference between the proportions that would be seen in samples that were perfectly randomly-generated). Alternatively, the Z-statistic can be thought of as a signal-to-noise ratio: a large value would indicate that the difference (between the two samples’ proportions) is large relative to random variation (a difference that could occur by chance).

Under the assumption that Z-statistics are normally-distributed, we can then calculate how unlikely it is that your samples have produced the Z-statistic that they have. More specifically, the Z-statistic is compared with its reference distribution (the standard normal distribution) to return a p-value.

The code on this page uses the numpy, scipy and statsmodels packages. These can be installed from the terminal with:

$ python3.11 -m pip install numpy
$ python3.11 -m pip install scipy
$ python3.11 -m pip install statsmodels

where python3.11 corresponds to the version of Python you have installed and are using.

1 Example Data

As an example we will use the “Spector and Mazzeo (1980) - Program Effectiveness Data” which is included in the statsmodels package as spector, see here for more. This dataset records the test results of students:

import statsmodels.api as sm

# Load the data set as a dataframe-of-dataframes
data = sm.datasets.spector.load_pandas()
# Extract the complete dataset
df = data['data']
# Rename
df = df.rename(columns={
    'TUCE': 'test_score',
    'PSI': 'participated',
    'GRADE': 'grade_improved'
})

print(df[15:22])
##      GPA  test_score  participated  grade_improved
## 15  2.74        19.0           0.0             0.0
## 16  2.75        25.0           0.0             0.0
## 17  2.83        19.0           0.0             0.0
## 18  3.12        23.0           1.0             0.0
## 19  3.16        25.0           1.0             1.0
## 20  2.06        22.0           1.0             0.0
## 21  3.62        28.0           1.0             1.0

2 Hypotheses

One of the columns in our dataset is grade_improved which records whether or not the student’s test scores improved (1 for yes, 0 for no). Another column is participated which indicates if that student participated in a programme where they received more personalised teaching (again, 1 for yes, 0 for no). We can hypothesise that a different proportion of students who got the personalised teaching saw their scores improve compared to those who didn’t:

  • \(H_0: \pi_{group \, 0} = \pi_{group \, 1}\)
  • \(H_1: \pi_{group \, 0} \neq \pi_{group \, 1}\)

or

  • \(H_0: \pi_{group \, 0} - \pi_{group \, 1} = 0\)
  • \(H_1: \pi_{group \, 0} - \pi_{group \, 1} \neq 0\)

where \(\pi\) is the proportion of students whose grades improved and where group 0 is those who did not participate in the programme while group 1 is those who did. Stated in words the hypotheses are:

  • The true proportion of students improving their test score is unrelated to/does not depend on whether or not they participated in the personalized system of instruction programme, against:
  • The true proportion of students improving their test score is related to/depends on whether or not they participated in the personalized system of instruction programme

3 Test the Hypotheses

We test \(H_0\) against \(H_1\) using a two-sample Z-test for a proportion:

import numpy as np
from statsmodels.stats.proportion import proportions_ztest

# Subset out the two groups
group0 = df[df['participated'] == 0]
group1 = df[df['participated'] == 1]

# Number of successes
count = np.array([
    len(group0[group0['grade_improved'] == 1]),
    len(group1[group1['grade_improved'] == 1])
])
# Number of observations
nobs = np.array([len(group0), len(group1)])
# Perform a two-sample Z-test for a proportion
z_stat, p_value = proportions_ztest(count, nobs)

print(f'Z-statistic = {z_stat:.3f}; p = {p_value:.3f}')
## Z-statistic = -2.391; p = 0.017
# Proportion successful
pi = count / nobs

print(
    f'Proportion of students who improved after not participating in the programme: {pi[0]:.1%}',
    f'\nProportion of students who improved after participating in the programme: {pi[1]:.1%}'
)
## Proportion of students who improved after not participating in the programme: 16.7% 
## Proportion of students who improved after participating in the programme: 57.1%

4 Comparison with Pearson’s Chi-Squared Independence Test

An important insight to note is that this p-value is the same as that of Pearson’s chi-squared (pronounced “kai-squared”) independence test:

from scipy import stats

# Are two distributions independent?
count1 = np.array([
    len(group0[group0['grade_improved'] == 1]),
    len(group1[group1['grade_improved'] == 1])
])
count0 = np.array([
    len(group0[group0['grade_improved'] == 0]),
    len(group1[group1['grade_improved'] == 0])
])
observations = np.array(
    [
        count1,
        count0
    ]
)
row_totals = np.array([np.sum(observations, axis=1)])
col_totals = np.array([np.sum(observations, axis=0)])
n = np.sum(observations)
# Calculate the expected observations
expected = np.dot(row_totals.T, col_totals) / n
# Calculate the chi-square test statistic
chisq, p = stats.chisquare(observations, expected)
chisq = np.sum(chisq)
# Degrees of freedom
rows = observations.shape[0]
cols = observations.shape[1]
df = (rows - 1) * (cols - 1)

# Convert chi-square test statistic to p-value
p = 1 - stats.chi2.cdf(chisq, df)

print(f'p = {p}')
## p = 0.01678005766053403

This isn’t a fluke: both tests have the same hypotheses so we would expect the same results!

5 Confidence Interval

Use the binomial proportion confidence interval:

# Difference of the proportions
diff_pi = pi[1] - pi[0]
# Standard errors of the proportions
se = np.sqrt((pi * (1 - pi)) / nobs)
# Standard error of the difference of the proportions
se_diff_pi = np.sqrt(np.sum(se**2))
# Significance level
alpha = 0.05
# Percent-point function (aka quantile function) of the normal distribution
z_critical = stats.norm.ppf(1 - (alpha / 2))
# Margin of error
d = z_critical * se_diff_pi
# Confidence interval
ci_lower = diff_pi - d
ci_upper = diff_pi + d

print(
    f'π₁ - π₀ = {diff_pi:.1%}, 95% CI [{ci_lower:.1%}, {ci_upper:.1%}]',
    f'\nThe proportion of group 1 that improved is {abs(diff_pi):.1%} ± {d:.1%} greater than group 2.'
)
## π₁ - π₀ = 40.5%, 95% CI [9.4%, 71.6%] 
## The proportion of group 1 that improved is 40.5% ± 31.1% greater than group 2.

6 Interpretation

  • We reject the null hypothesis in favour of the alternative because the observed data provides evidence against it at the 5% significance level.
  • We conclude that the true proportion of students improving their test score is not the same for both groups. In other words, the proportion of students improving their mark is associated with programme participation.
  • When generalising to the target population, we are 95% confident that the interval between 9.4% and 71.6% contains the true difference between the underlying proportions of improvement between the groups.
  • We are 95% confident that our sample difference of 40.5% is within 31.1 pp (percentage points) of the true difference between underlying proportions.

7 Exact Test

Instead of a Z-test we could consider using an exact test:

# Expected counts
# (the frequencies/proportions we expect to see under the null hypothesis)
pi_overall = np.sum(count) / np.sum(nobs)
expected_counts = pi_overall * nobs

print(expected_counts)
## [6.1875 4.8125]
# The most common exact test for comparing two proportions is a Fisher's exact
# test (when there are expected counts lower than 5)
table = [
    [count[0], nobs[0] - count[0]],
    [count[1], nobs[1] - count[1]]
]
odds_ratio, p_value = stats.fisher_exact(table)

print(f"Fisher's exact test: p = {p_value:.3f}")
## Fisher's exact test: p = 0.027

Remember that a difference must be statistically significant AND sufficiently large to be of practical importance. For example, the Higher Education Statistics Agency (HESA) requires differences between groups under investigation to be:

  • Statistically significant at the 0.3% level, and
  • Larger than 3%

⇦ Back