Processing math: 100%

⇦ Back

If you have:

…then the two-sample Z-test for a proportion might be for you. Consult the following flowchart for a more complete decision-making process:

statistical_tests Flowchart for Choosing a Statistical Test start Identify the dependent/outcome variable(s) (DVs) and independent/explanatory variable(s) (IVs) dependent Data type of the DV? start->dependent sc_datatypes Data type of the IV? dependent->sc_datatypes Scale/ Continuous cd_datatype Type of categorical data? dependent->cd_datatype Categorical/ Discrete sc_true True independent variable? sc_datatypes->sc_true Scale/ Continuous none_parametric Parametric? sc_datatypes->none_parametric None (no IV) cd_number Number of groups/samples for the IV? sc_datatypes->cd_number Categorical/ Discrete yes_number Number of groups/samples for the IV? sc_true->yes_number Yes (regression analysis) no_parametric Parametric? sc_true->no_parametric No (correlation analysis) one_choose Choose a fit yes_number->one_choose One Multiple Linear Regression Multiple Linear Regression yes_number->Multiple Linear Regression More Simple Linear Regression Simple Linear Regression one_choose->Simple Linear Regression Quadratic Regression Quadratic Regression one_choose->Quadratic Regression Pearson correlation coefficient Pearson correlation coefficient no_parametric->Pearson correlation coefficient Yes Spearmans rank correlation coefficient Spearman’s rank correlation coefficient no_parametric->Spearmans rank correlation coefficient No One-sample t-test One-sample t -test none_parametric->One-sample t-test Yes two_parametric Parametric? cd_number->two_parametric Two more_parametric Parametric? cd_number->more_parametric More true_independent Independent groups/sample? two_parametric->true_independent Yes false_independent Independent groups/sample? two_parametric->false_independent No Unpaired two-sample t-test Unpaired two-sample t -test true_independent->Unpaired two-sample t-test Yes Paired two-sample t-test Paired two-sample t -test true_independent->Paired two-sample t-test No Mann-Whitney U test Mann-Whitney U test false_independent->Mann-Whitney U test Yes Wilcoxon signed-rank test Wilcoxon signed-rank test false_independent->Wilcoxon signed-rank test No ANOVA ANOVA more_parametric->ANOVA Yes false_independent2 Independent groups/sample? more_parametric->false_independent2 No Kruskal-Wallis one-way ANOVA Kruskal-Wallis one-way ANOVA false_independent2->Kruskal-Wallis one-way ANOVA Yes Friedman two-way ANOVA Friedman two-way ANOVA false_independent2->Friedman two-way ANOVA No binary_datatype Data type of the IV? cd_datatype->binary_datatype Binary (nominal) ordinal_datatype Data type of the IV? cd_datatype->ordinal_datatype Ordinal nominal_datatype Data type of the IV? cd_datatype->nominal_datatype Nominal (incl binary) One-sample Z-test One-sample Z -test binary_datatype->One-sample Z-test None (no IV) Two-sample Z-test Two-sample Z -test binary_datatype->Two-sample Z-test Binary (nominal) χ² test for trend χ² test for trend binary_datatype->χ² test for trend Ordinal ordinal_datatype->χ² test for trend Binary (nominal) none_additional Additional DV? nominal_datatype->none_additional None (no IV) χ² homogeneity test χ² homogeneity test nominal_datatype->χ² homogeneity test Nominal χ² goodness-of-fit test χ² goodness-of-fit test none_additional->χ² goodness-of-fit test No χ² independence test χ² independence test none_additional->χ² independence test Yes

This test, like all Z-tests, involves calculating a Z-statistic which, loosely, is the distance between what you have (the difference between the proportions actually seen in your samples) and what random chance would give you (the difference between the proportions that would be seen in samples that were perfectly randomly-generated). Alternatively, the Z-statistic can be thought of as a signal-to-noise ratio: a large value would indicate that the difference (between the two samples’ proportions) is large relative to random variation (a difference that could occur by chance).

Under the assumption that Z-statistics are normally-distributed, we can then calculate how unlikely it is that your samples have produced the Z-statistic that they have. More specifically, the Z-statistic is compared with its reference distribution (the standard normal distribution) to return a p-value.

The code on this page uses the numpy, scipy and statsmodels packages. These can be installed from the terminal with:

$ python3.11 -m pip install numpy
$ python3.11 -m pip install scipy
$ python3.11 -m pip install statsmodels

where python3.11 corresponds to the version of Python you have installed and are using.

1 Example Data

As an example we will use the “Spector and Mazzeo (1980) - Program Effectiveness Data” which is included in the statsmodels package as spector, see here for more. This dataset records the test results of students:

import statsmodels.api as sm

# Load the data set as a dataframe-of-dataframes
data = sm.datasets.spector.load_pandas()
# Extract the complete dataset
df = data['data']
# Rename
df = df.rename(columns={
    'TUCE': 'test_score',
    'PSI': 'participated',
    'GRADE': 'grade_improved'
})

print(df[15:22])
##      GPA  test_score  participated  grade_improved
## 15  2.74        19.0           0.0             0.0
## 16  2.75        25.0           0.0             0.0
## 17  2.83        19.0           0.0             0.0
## 18  3.12        23.0           1.0             0.0
## 19  3.16        25.0           1.0             1.0
## 20  2.06        22.0           1.0             0.0
## 21  3.62        28.0           1.0             1.0

2 Hypotheses

One of the columns in our dataset is grade_improved which records whether or not the student’s test scores improved (1 for yes, 0 for no). Another column is participated which indicates if that student participated in a programme where they received more personalised teaching (again, 1 for yes, 0 for no). We can hypothesise that a different proportion of students who got the personalised teaching saw their scores improve compared to those who didn’t:

  • H0:πgroup0=πgroup1
  • H1:πgroup0πgroup1

or

  • H0:πgroup0πgroup1=0
  • H1:πgroup0πgroup10

where π is the proportion of students whose grades improved and where group 0 is those who did not participate in the programme while group 1 is those who did. Stated in words the hypotheses are:

  • The true proportion of students improving their test score is unrelated to/does not depend on whether or not they participated in the personalized system of instruction programme, against:
  • The true proportion of students improving their test score is related to/depends on whether or not they participated in the personalized system of instruction programme

3 Test the Hypotheses

We test H0 against H1 using a two-sample Z-test for a proportion:

import numpy as np
from statsmodels.stats.proportion import proportions_ztest

# Subset out the two groups
group0 = df[df['participated'] == 0]
group1 = df[df['participated'] == 1]

# Number of successes
count = np.array([
    len(group0[group0['grade_improved'] == 1]),
    len(group1[group1['grade_improved'] == 1])
])
# Number of observations
nobs = np.array([len(group0), len(group1)])
# Perform a two-sample Z-test for a proportion
z_stat, p_value = proportions_ztest(count, nobs)

print(f'Z-statistic = {z_stat:.3f}; p = {p_value:.3f}')
## Z-statistic = -2.391; p = 0.017
# Proportion successful
pi = count / nobs

print(
    f'Proportion of students who improved after not participating in the programme: {pi[0]:.1%}',
    f'\nProportion of students who improved after participating in the programme: {pi[1]:.1%}'
)
## Proportion of students who improved after not participating in the programme: 16.7% 
## Proportion of students who improved after participating in the programme: 57.1%

4 Comparison with Pearson’s Chi-Squared Independence Test

An important insight to note is that this p-value is the same as that of Pearson’s chi-squared (pronounced “kai-squared”) independence test:

from scipy import stats

# Are two distributions independent?
count1 = np.array([
    len(group0[group0['grade_improved'] == 1]),
    len(group1[group1['grade_improved'] == 1])
])
count0 = np.array([
    len(group0[group0['grade_improved'] == 0]),
    len(group1[group1['grade_improved'] == 0])
])
observations = np.array(
    [
        count1,
        count0
    ]
)
row_totals = np.array([np.sum(observations, axis=1)])
col_totals = np.array([np.sum(observations, axis=0)])
n = np.sum(observations)
# Calculate the expected observations
expected = np.dot(row_totals.T, col_totals) / n
# Calculate the chi-square test statistic
chisq, p = stats.chisquare(observations, expected)
chisq = np.sum(chisq)
# Degrees of freedom
rows = observations.shape[0]
cols = observations.shape[1]
df = (rows - 1) * (cols - 1)

# Convert chi-square test statistic to p-value
p = 1 - stats.chi2.cdf(chisq, df)

print(f'p = {p}')
## p = 0.01678005766053403

This isn’t a fluke: both tests have the same hypotheses so we would expect the same results!

5 Confidence Interval

Use the binomial proportion confidence interval:

# Difference of the proportions
diff_pi = pi[1] - pi[0]
# Standard errors of the proportions
se = np.sqrt((pi * (1 - pi)) / nobs)
# Standard error of the difference of the proportions
se_diff_pi = np.sqrt(np.sum(se**2))
# Significance level
alpha = 0.05
# Percent-point function (aka quantile function) of the normal distribution
z_critical = stats.norm.ppf(1 - (alpha / 2))
# Margin of error
d = z_critical * se_diff_pi
# Confidence interval
ci_lower = diff_pi - d
ci_upper = diff_pi + d

print(
    f'π₁ - π₀ = {diff_pi:.1%}, 95% CI [{ci_lower:.1%}, {ci_upper:.1%}]',
    f'\nThe proportion of group 1 that improved is {abs(diff_pi):.1%} ± {d:.1%} greater than group 2.'
)
## π₁ - π₀ = 40.5%, 95% CI [9.4%, 71.6%] 
## The proportion of group 1 that improved is 40.5% ± 31.1% greater than group 2.

6 Interpretation

  • We reject the null hypothesis in favour of the alternative because the observed data provides evidence against it at the 5% significance level.
  • We conclude that the true proportion of students improving their test score is not the same for both groups. In other words, the proportion of students improving their mark is associated with programme participation.
  • When generalising to the target population, we are 95% confident that the interval between 9.4% and 71.6% contains the true difference between the underlying proportions of improvement between the groups.
  • We are 95% confident that our sample difference of 40.5% is within 31.1 pp (percentage points) of the true difference between underlying proportions.

7 Exact Test

Instead of a Z-test we could consider using an exact test:

# Expected counts
# (the frequencies/proportions we expect to see under the null hypothesis)
pi_overall = np.sum(count) / np.sum(nobs)
expected_counts = pi_overall * nobs

print(expected_counts)
## [6.1875 4.8125]
# The most common exact test for comparing two proportions is a Fisher's exact
# test (when there are expected counts lower than 5)
table = [
    [count[0], nobs[0] - count[0]],
    [count[1], nobs[1] - count[1]]
]
odds_ratio, p_value = stats.fisher_exact(table)

print(f"Fisher's exact test: p = {p_value:.3f}")
## Fisher's exact test: p = 0.027

Remember that a difference must be statistically significant AND sufficiently large to be of practical importance. For example, the Higher Education Statistics Agency (HESA) requires differences between groups under investigation to be:

  • Statistically significant at the 0.3% level, and
  • Larger than 3%

⇦ Back