1 Example Data
2 Hypotheses
3 Test the Hypotheses
- 3.1 Using statsmodels
- 3.2 Using scipy and numpy
4 Comparison with Pearson’s Chi-Squared Goodness-of-Fit Test
5 Confidence Interval
6 Interpretation
7 Is the Sample Size Too Small?

If you have:

a sample of data,
of which a certain proportion has some characteristic
and you want to see if this proportion is significantly different from the proportion that you expect,

…then the one-sample Z-test for a proportion might be for you. Consult the following flowchart for a more complete decision-making process:

This test, like all Z-tests, involves calculating a Z-statistic which, loosely, is the distance between what you have (the proportion actually seen in your sample) and what random chance would give you (the proportion seen in a sample that was perfectly randomly-generated). Alternatively, the Z-statistic can be thought of as a signal-to-noise ratio: a large value would indicate that the difference (between the observed and expected proportions) is large relative to random variation (a difference that could occur by chance).

Under the assumption that Z-statistics are normally-distributed, we can then calculate how unlikely it is that your sample has produced the Z-statistic that it has. More specifically, the Z-statistic is compared with its reference distribution (the standard normal distribution) to return a p-value.

The code on this page uses the numpy, scipy and statsmodels packages. These can be installed from the terminal with:

$ python3.11 -m pip install numpy
$ python3.11 -m pip install scipy
$ python3.11 -m pip install statsmodels

where python3.11 corresponds to the version of Python you have installed and are using.

1 Example Data

As an example we will use the “Spector and Mazzeo (1980) - Program Effectiveness Data” which is included in the statsmodels package as spector, see here for more. This dataset records the test results of students:

import statsmodels.api as sm

# Load the data set as a dataframe-of-dataframes
data = sm.datasets.spector.load_pandas()
# Extract the complete dataset
df = data['data']
# Rename
df = df.rename(columns={
    'TUCE': 'test_score',
    'PSI': 'participated',
    'GRADE': 'grade_improved'
})

print(df[15:22])

##      GPA  test_score  participated  grade_improved
## 15  2.74        19.0           0.0             0.0
## 16  2.75        25.0           0.0             0.0
## 17  2.83        19.0           0.0             0.0
## 18  3.12        23.0           1.0             0.0
## 19  3.16        25.0           1.0             1.0
## 20  2.06        22.0           1.0             0.0
## 21  3.62        28.0           1.0             1.0

2 Hypotheses

One of the columns in our dataset is grade_improved which records whether or not the student’s test scores improved (1 for yes, 0 for no). Now, if it were the case that 50% of students’ scores improved (and 50% did not) then that would be interesting: it would suggest that it might be completely random as to whether a student improves or not. This might tell us something about the students, or the test, or the teaching method, or maybe not. In any case, it’s worth taking a look.

When working with categorical data like we have here (specifically, we have binary data) we talk about proportion, \(\pi\), instead of mean, \(\mu\). We are interested in whether or not 50% of students improved their test score - ie if the true proportion of students who improved is 0.5 - hence our null hypothesis is that \(\pi = 0.5\). Conversely, our alternative hypothesis is that the true proportion is not 50%; \(\pi \neq 0.5\):

\(H_0: \pi = 0.5\)
\(H_1: \pi \neq 0.5\)

3 Test the Hypotheses

If the sample size is large (\(n > 30\)) or the population variance is known, we can use a Z-test as opposed to a t-test for our hypothesis testing. In our example, we don’t know the population variance (we haven’t investigated the test scores of all humans) but our sample size is above 30 (it’s only 32, but that’s good enough for an example). So we’ll use the one-sample Z-test for a proportion.

3.1 Using `statsmodels`

This test is available in statsmodels as proportions_ztest, see here for the documentation.

from statsmodels.stats.proportion import proportions_ztest

# Number of successes
count = len(df[df['grade_improved'] == 1])
# Number of observations
nobs = len(df)
# Proportion under the null hypothesis
pi_0 = 0.5
# Perform a one-sample Z-test for a proportion
z_stat, p_value = proportions_ztest(count, nobs, value=pi_0, prop_var=pi_0)
# Proportion successful
pi = count / nobs

print(f'Proportion, π = {pi:.1%}; Z-statistic = {z_stat:.3f}; p = {p_value:.3f}')

## Proportion, π = 34.4%; Z-statistic = -1.768; p = 0.077

3.2 Using `scipy` and `numpy`

We don’t have to use the statsmodels function; we can do it ‘manually’ with scipy and numpy as shown below:

from scipy import stats
import numpy as np

z_stat = (pi - pi_0) / np.sqrt(pi_0 * (1 - pi_0) / nobs)
p_value = stats.norm.cdf(z_stat) * 2

print(f'Proportion, π = {pi:.1%}; Z-statistic = {z_stat:.3f}; p = {p_value:.3f}')

## Proportion, π = 34.4%; Z-statistic = -1.768; p = 0.077

4 Comparison with Pearson’s Chi-Squared Goodness-of-Fit Test

An important insight to note is that this p-value is the same as that of Pearson’s chi-squared (pronounced “kai-squared”) goodness-of-fit test:

# Frequency of successful observations
f_obs = [count, nobs - count]
# Frequency of successful observations we expect under the null hypothesis
f_exp = [nobs * pi_0, nobs * pi_0]
# Perform a one-way chi-square test
chisq, p = stats.chisquare(f_obs, f_exp)
# Proportion of successful observations
pi = count / nobs

print(f'Proportion, π = {pi:.1%}; chi-squared, χ² = {chisq:.3f}; p = {p:.3f}')

## Proportion, π = 34.4%; chi-squared, χ² = 3.125; p = 0.077

This isn’t a fluke: both tests have the same hypotheses so we would expect the same results!

5 Confidence Interval

Use the binomial proportion confidence interval:

# Standard error of the proportion
se = np.sqrt((pi * (1 - pi)) / nobs)
# Significance level
alpha = 0.05
# Percent-point function (aka quantile function) of the normal distribution
z_critical = stats.norm.ppf(1 - (alpha / 2))
# Margin of error
d = z_critical * se
# Confidence interval
ci_lower = pi - d
ci_upper = pi + d

print(f'π = {pi:.1%} ± {d:.1%}')

## π = 34.4% ± 16.5%

print(f'π = {pi:.1%}, 95% CI [{ci_lower:.1%}, {ci_upper:.1%}]')

## π = 34.4%, 95% CI [17.9%, 50.8%]

6 Interpretation

We fail to reject the null hypothesis in favour of the alternative, because the observed data does not provide evidence against it.
We conclude that the true proportion of students improving their scores could be 50%.
When generalising to the target population, we are 95% confident that the interval between 17.9% and 50.8% contains the true underlying proportion of students who improved their mark.
We are 95% confident that our sample proportion of 34.4% is within 16.5 pp (percentage points) of the true underlying proportion.

7 Is the Sample Size Too Small?

As a rule of thumb, we want both the number of observed events and non-events to be at least 5. Here’s what happens if we only use a subset of our full dataset:

too_small = df[15:22]

# Number of successes
count = len(too_small[too_small['grade_improved'] == 1])
# Number of observations
nobs = len(too_small)
# Proportion under the null hypothesis
pi_0 = 0.5
# Perform a one-sample Z-test for a proportion
zstat, pvalue = proportions_ztest(count, nobs, value=pi_0, prop_var=pi_0)
# Proportion successful
pi = count / nobs
# Standard error of the proportion
se = np.sqrt((pi * (1 - pi)) / nobs)
# Significance level
alpha = 0.05
# Percent-point function (aka quantile function) of the normal distribution
z_critical = stats.norm.ppf(1 - (alpha / 2))
# Margin of error
d = z_critical * se
# Confidence interval
ci_lower = pi - d
ci_upper = pi + d

print(
    f'Z-statistic = {zstat:.3f}; p = {pvalue:.3f}',
    f'\nProportion, π = {pi:.1%} ± {d:.1%}',
    f'ie 95% CI [{ci_lower:.1%}, {ci_upper:.1%}]'
)

## Z-statistic = -1.134; p = 0.257 
## Proportion, π = 28.6% ± 33.5% ie 95% CI [-4.9%, 62.0%]

Having -4.9% as the lower bound of a 95% confidence interval for a true proportion is clearly nonsense: a proportion cannot be less than zero! We need more data.

⇦ Back

Statistics in Python:One-Sample Z-Test for a Proportion