A chi-squared (χ², pronounced “kai-squared”) test will often be used with discrete, categorical data. For example:

Medical data that includes two groups: people who are sick and people who are healthy
Demographic data comparing people who are male and female
Assessment data where people either passed or failed an assignment

Chi-squared tests are not relevant for continuous data, ie where each measurement/observation in the data set is a ‘value’ (eg a percentage, a height in cm, a weight in kg, etc). Also note that for a chi-squared test to be valid there must be more than 5 samples of each type. See the Wikipedia page here for more info.

There are three main chi-squared tests (known as Pearson’s chi-squared tests):

Goodness-of-fit
Independence
Homogeneity

In Python, all three are performed using the same function: chisquare() from the SciPy stats package. For more info about this function see its documentation. In the special cases where the dependent variable is binary and the independent variable has one, two or three levels then these three tests give the same answers as the one-, two- and three-sample Z-tests, respectively. Examples are provided that demonstrate this.

A less-common chi-squared test is the “test for trend”. This is not yet implemented in Python but worked examples are nonetheless provided on this page.

All four types of chi-squared test are highlighted in green in the below flowchart, illustrating the decision processes that would lead to each being chosen for use with a particular data set:

Assumptions of a Chi-Square Test

In addition to the data types and samples sizes needing to be as specified above, there are three other assumptions that need to be met in order for a chi-squared test to be valid:

The observations should be independently, randomly sampled from the population. This minimizes bias and helps to ensure that the data is representative of the population.
The categories of both variables must be mutually exclusive. It should not be possible for a data point to fall into more than one category.
The groups should be independent. The number of samples that fall into one category should not influence the number that fall into another category.

Python Packages

The code on this page uses the NumPy, SciPy, pandas, Statsmodels and Matplotlib packages. These can be installed from the terminal with:

# "python3.12" corresponds to the version of Python you have installed and are using
$ python3.12 -m pip install numpy
$ python3.12 -m pip install scipy
$ python3.12 -m pip install pandas
$ python3.12 -m pip install statsmodels
$ python3.12 -m pip install matplotlib

1 Goodness-of-Fit Test

This test answers the question “does my distribution of numbers come from the distribution I expect?”.

1.1 Wikipedia Example

The following example comes from Wikipedia:

If you randomly picked 100 people out of a population and 44 of that sample were men, is that population made up of 50% men or is it skewed?

You would have observed 44 men and 56 women but your expectation would have been 50 men and 50 women. Let’s check to see if we can use a chi-squared test to test our question:

The dependent variable is whether a person in your sample is a man or a women, so this is a categorical/discrete type of data (specifically, it is nominal)
- There are two categories/levels (ie two possible values) for this dependent variable: it is binary
There is no independent variable: you are choosing a single sample of people without controlling for any factors
There is no additional dependent variable because you are not measuring or recording anything else about these people
There are more than 5 samples in each group
The data has been randomly sampled from the population and the groups are mutually exclusive and independent of each other

Thus, the assumptions required for a chi-squared test are met and - as per the flowchart shown above - we could either use a one-sample Z-test or a χ² goodness-of-fit test to investigate this question. We will do both, but let’s start with the latter: testing the ‘goodness-of-fit’ of the sample with the expected distribution:

from scipy import stats as st

# We observed 44 men and 56 women
observed = [44, 56]
# We expected 50 men and 50 women
expected = [50, 50]
# Perform the chi-squared goodness-of-fit test
chisq, p = st.chisquare(observed, expected)

print(f'χ² = {chisq:.2f}, p = {p:.2f}')

## χ² = 1.44, p = 0.23

As you can see, the chisquare() function takes the numbers that were observed in a list as the first argument and the numbers that were expected in a list as the second argument. The p-value of 0.23 suggests that there is a 23% chance that the population is 50% male/female.

As it happens, if you omit the second argument the chisquare() function assumes you expect equal numbers of each type of observation by default. Therefore, in this example (because we expect equal numbers of men and women) we can actually leave the second argument out:

# We observed 44 men and 56 women
observed = [44, 56]
# Perform the chi-squared goodness-of-fit test
chisq, p = st.chisquare(observed)

print(f'χ² = {chisq:.2f}, p = {p:.2f}')

## χ² = 1.44, p = 0.23

If, in a different sample, we observed 50 men and 50 women, the test would tell use that there is a 100% chance that the population is 50% male/female:

# We observed 50 men and 50 women
observed = [50, 50]
# Perform the chi-squared goodness-of-fit test
chisq, p = st.chisquare(observed)

print(f'χ² = {chisq:.2f}, p = {p:.2f}')

## χ² = 0.00, p = 1.00

Note that the test does NOT work properly if you give it your data as proportions or percentages:

observed = [0.44, 0.56]
chisq, p = st.chisquare(observed)

print(f'χ² = {chisq:.2f}, p = {p:.2f}')

## χ² = 0.01, p = 0.90

You must always provide the actual numbers observed and expected.

1.1.1 Equivalence with the One-Sample Z-Test

As mentioned, the flowchart for choosing a statistical test suggests that we can use either a one-sample Z-test or a χ² goodness-of-fit test. The reason for this is that we only have one sample (ie there is no independent variable) and our dependent variable is binary. So let’s have a look at what the proportions_ztest() function from Statsmodels gives us (see this page for more on the one-sample Z-test):

from statsmodels.stats.proportion import proportions_ztest

# Number of men
count = 44
# Number of observations
nobs = 100
# Proportion under the null hypothesis
pi_0 = 0.5
# Perform a one-sample Z-test for a proportion
z_stat, p_value = proportions_ztest(count, nobs, value=pi_0, prop_var=pi_0)

print(f'Z-statistic = {z_stat:.2f}, p = {p_value:.2f}')

## Z-statistic = -1.20, p = 0.23

So, the Z-statistic is different to the χ² value (-1.20 vs 1.44) but the p-values are the same. In other words, these two tests are different ways to get to the same answer. Of course, that’s only true in the special case where you have a binary dependent variable and no independent variable (ie one sample). Our next example will have more than two levels to the dependent variable, so it will not be possible to do it with the one-sample Z-test.

1.2 Crash Course Example

The following example comes from this Crash Course Youtube video:

It is claimed by a game’s developers that 15% of players choose a ‘healer’ as their character, 20% choose a ‘tank’, 20% choose an ‘assassin’ and 45% choose a ‘fighter’. You do some of your own observations of people playing and you record that 25 choose healers, 35 choose tanks, 50 choose assassins and 90 choose fighters. Does this data support the distribution purported by the developers?

Firstly, count the number of characters you observed and calculate the proportion of each type that you would have expected to see, if the developers were correct:

import numpy as np

# Numbers you observed
observed = np.array([25, 35, 50, 90])
# Total observed
n = np.sum(observed)
# Distribution purported by the developers
distribution = np.array([0.15, 0.20, 0.20, 0.45])
# Calculate the expected distribution
expected = n * distribution

print(expected)

## [30. 40. 40. 90.]

So here’s our data in a table:

	Healer	Tank	Assassin	Fighter
Observed	25	35	50	90
Expected	30	40	40	90

Let’s double check that the goodness-of-fit test is appropriate:

The dependent variable is the type of character chosen. Again, this is categorical/discrete (specifically, nominal) data
There are four categories/levels (ie four possible values) to this dependent variable
There is no independent variable: a single set of observations was made without controlling for any factors
There is no additional dependent variable because you are not measuring or recording anything else
Each ‘bin’ (cell in the table) has more than 5 samples

So a chi-squared test is appropriate:

# Perform the chi-squared goodness-of-fit test
chisq, p = st.chisquare(observed, expected)

print(f'χ² = {chisq:.3f}, p = {p:.2f}')

## χ² = 3.958, p = 0.27

This matches their answer.

1.2.1 Degrees of Freedom

Note that this example has 3 degrees of freedom. In general, the formula for degrees of freedom is (nrows - 1) * (ncols - 1):

nrows = 2
ncols = 4
dof = (nrows - 1) * (ncols - 1)

print(f'Degrees of freedom = {dof}')

## Degrees of freedom = 3

1.2.2 Doing it by Hand

This example can also be done manually by calculating the chi-square test statistic by hand:

\[\chi^2 = \sum_{i=1}^{n}\frac{\left(observed_i - expected_i\right)^2}{expected_i}\]

chisq = (25 - 30)**2 / 30 + (35 - 40)**2 / 40 + (50 - 40)**2 / 40 + (90 - 90)**2 / 90

print(f'χ² = {chisq:.3f}')

## χ² = 3.958

Then, this test statistic can be converted into a p-value by using the chi-distribution for 3 degrees of freedom:

p_value = 1 - st.chi2.cdf(chisq, dof)

print(f'p = {p_value:.2f}')

## p = 0.27

As the p-value is above 0.05 we fail to reject the null hypothesis that the distribution of characters chosen by players is as the developers claimed.

2 Independence Test

The chi-squared independence test answers the question “are these two categorical variables independent of each other?”. Data used in goodness-of-fit tests has one categorical dependent variable, so this is one step more complicated.

2.1 Crash Course Example

Again, using an example from the Crash Course Youtube video:

Is someone’s Hogwarts house independent of whether or not they like pineapple on pizza?

Here’s the raw data as presented in the video:

Observed:

	Gryffindor	Hufflepuff	Ravenclaw	Slytherin
No to pineapple on pizza	79	122	204	74
Yes to pineapple on pizza	82	130	240	69

The above is known as a contingency table or a cross tabulation.

Let’s check the flowchart to confirm that a chi-squared test of independence is appropriate: we have two dependent variables, both are categorical (specifically, nominal) and no independent variable (only one sample was taken). So we’re all good.

2.1.1 Solution from First Principles

If the two dependent variables - Hogwarts house and pineapple on pizza preference - are indeed independent we can work out what we would expect the data to look like given the number of people who were surveyed:

#
# Are these two categorical variables independent?
#
# Collate the observed data
observed = np.array(
    [
        [79, 122, 204, 74],
        [82, 130, 240, 69]
    ]
)
# Calculate totals
row_totals = np.array([np.sum(observed, axis=1)])
col_totals = np.array([np.sum(observed, axis=0)])
n = np.sum(observed)
# Calculate the expected observations
expected = np.dot(row_totals.T, col_totals) / n

print(np.round(expected, 2))

## [[ 77.12 120.71 212.68  68.5 ]
##  [ 83.88 131.29 231.32  74.5 ]]

Here it is in a table:

Expected:

	Gryffindor	Hufflepuff	Ravenclaw	Slytherin
No to pineapple on pizza	77.12	120.71	212.68	68.5
Yes to pineapple on pizza	83.88	131.29	231.32	74.5

In Python, if we try to use the chisquare() function in the same way as we did before it will assume we are trying to do four goodness-of-fit tests and so it will give us four answers:

# Calculate the chi-square test statistic
chisq, p_value = st.chisquare(observed, expected)

print(chisq, p_value)

## [0.08805996 0.02654308 0.67933326 0.84857406] [0.76665816 0.87058106 0.40981641 0.35695604]

So we need to do some post-hoc calculation on our own:

# Sum the answers
chisq = np.sum(chisq)
# Degrees of freedom
rows = observed.shape[0]
cols = observed.shape[1]
dof = (rows - 1) * (cols - 1)
# Convert chi-square test statistic to p-value
p_value = 1 - st.chi2.cdf(chisq, dof)

print(f'χ² = {chisq:.3f}, p = {p_value:.2f}')

## χ² = 1.643, p = 0.65

This is the same as what they got in the video (note that Python rounds to the nearest even number by default and the rounded p-value of 0.6 matches what the video shows).

As the p-value is larger than 0.05 we fail to reject the null hypothesis that the distribution of Hogwarts houses is the same for both pineapple on pizza lovers and haters.

2.1.2 Solution Directly from the Contingency Table

A quicker option would have been to create the contingency table of the raw data as a data frame and use the chi2_contingency() function directly on that:

import pandas as pd

# Create a contingency table (also known as a cross tabulation)
dct = {
    'Gryffindor': [79, 82],
    'Hufflepuff': [122, 130],
    'Ravenclaw': [204, 240],
    'Slytherin': [74, 69],
}
crosstab = pd.DataFrame(dct, index=['No', 'Yes'])

print(crosstab)

##      Gryffindor  Hufflepuff  Ravenclaw  Slytherin
## No           79         122        204         74
## Yes          82         130        240         69

chi2, pval, dof, expected = st.chi2_contingency(crosstab)

print(f'χ² = {chi2:.3f}, p = {pval:.2f}, degrees of freedom = {dof}')

## χ² = 1.643, p = 0.65, degrees of freedom = 3

This also gives us the expected results under the assumption of independence:

print(expected)

## [[ 77.119 120.708 212.676  68.497]
##  [ 83.881 131.292 231.324  74.503]]

2.1.3 Solution from Raw Data in Long Format

Usually in the real world we will get raw data in long format. In this example it has been given to us as a contingency table, but for the sake of demonstration here’s how to reverse-engineer it into long format:

# Reverse-engineer the raw data
s = crosstab.stack()
df = s.index.repeat(s).to_frame(index=False)
columns = {
    0: 'Pineapple on pizza?',
    1: 'Hogwarts house'
}
df = df.rename(columns=columns)

print(df)

##     Pineapple on pizza? Hogwarts house
## 0                    No     Gryffindor
## 1                    No     Gryffindor
## 2                    No     Gryffindor
## 3                    No     Gryffindor
## 4                    No     Gryffindor
## ..                  ...            ...
## 995                 Yes      Slytherin
## 996                 Yes      Slytherin
## 997                 Yes      Slytherin
## 998                 Yes      Slytherin
## 999                 Yes      Slytherin
## 
## [1000 rows x 2 columns]

If this was indeed the format we received the raw data in, we could then convert it into a contingency table and use chi2_contingency() as above:

table = pd.crosstab(df['Pineapple on pizza?'], df['Hogwarts house'])
chi2, pval, dof, expected = st.chi2_contingency(table)

print(f'χ² = {chi2:.3f}, p = {pval:.2f}')

## χ² = 1.643, p = 0.65

2.1.4 Equivalence with the Two-Sample Z-Test

Let’s reduce the example so that we only have two houses:

Observed:

	Gryffindor	Hufflepuff
No to pineapple on pizza	79	122
Yes to pineapple on pizza	82	130

We can still use the chi-squared independence test:

# Collate the observed data
observed = np.array(
    [
        [79, 122],
        [82, 130]
    ]
)
# Calculate totals
row_totals = np.array([np.sum(observed, axis=1)])
col_totals = np.array([np.sum(observed, axis=0)])
n = np.sum(observed)
# Calculate the expected observations
expected = np.dot(row_totals.T, col_totals) / n
# Calculate the chi-square test statistic
chisq, p_value = st.chisquare(observed, expected)
# Sum the answers
chisq = np.sum(chisq)
# Degrees of freedom
rows = observed.shape[0]
cols = observed.shape[1]
dof = (rows - 1) * (cols - 1)
# Convert chi-square test statistic to p-value
p_value = 1 - st.chi2.cdf(chisq, dof)

print(f'χ² = {chisq:.3f}, p = {p_value:.3f}')

## χ² = 0.017, p = 0.897

However, we can also use the two-sample Z-test for a proportion (see more here) to get the same p-value (albeit from a different test statistic):

\(H_0: \pi_{group \, 0} = \pi_{group \, 1}\)
\(H_1: \pi_{group \, 0} \neq \pi_{group \, 1}\)

where \(\pi\) is the proportion who gave a certain response.

from statsmodels.stats.proportion import proportions_ztest

# Number of one type of response
count = [79, 122]
# Number of observations
nobs = [79 + 82, 122 + 130]
# Perform a two-sample Z-test for a proportion
z_stat, p_value = proportions_ztest(count, nobs)

print(f'Z-statistic = {z_stat:.3f}; p = {p_value:.3f}')

## Z-statistic = 0.130; p = 0.897

Note that it doesn’t make a difference mathematically if we swap the variables:

# Number of one type of response
count = [79, 82]
# Number of observations
nobs = [79 + 122, 82 + 130]
# Perform a two-sample Z-test for a proportion
z_stat, p_value = proportions_ztest(count, nobs)

print(f'Z-statistic = {z_stat:.3f}; p = {p_value:.3f}')

## Z-statistic = 0.130; p = 0.897

2.2 Stat Trek Example

Using an example from Stat Trek:

Do men and women in America have different voting preferences?

The results of a fictitious survey are as follows:

Observed:

	Republican	Democrat	Independent
Men	200	150	50
Women	250	300	50

Quick check of the flowchart: we have two categorical dependent variables (voting preference and gender, both of which are nominal and the latter is binary) and no independent variable (only one survey was taken). So a chi-squared test of independence is the way to go.

2.2.1 Solution from First Principles

# Collate the observed data
observed = np.array(
    [
        [200, 150, 50],
        [250, 300, 50]
    ]
)
# Calculate totals
row_totals = np.array([np.sum(observed, axis=1)])
col_totals = np.array([np.sum(observed, axis=0)])
n = np.sum(observed)
# Calculate the expected observations
expected = np.dot(row_totals.T, col_totals) / n
# Calculate the chi-square test statistic
chisq, p_value = st.chisquare(observed, expected)
# Sum the answers
chisq = np.sum(chisq)
# Degrees of freedom
rows = observed.shape[0]
cols = observed.shape[1]
dof = (rows - 1) * (cols - 1)
# Convert chi-square test statistic to p-value
p_value = 1 - st.chi2.cdf(chisq, dof)

print(f'χ² = {chisq:.3f}, p = {p_value:.4f}')

## χ² = 16.204, p = 0.0003

To quote the source: “since the p-value (0.0003) is less than the significance level (0.05), we cannot accept the null hypothesis. Thus, we conclude that there is a relationship between gender and voting preference”.

2.2.2 Solution Directly from the Contingency Table

# Create a contingency table (also known as a cross tabulation)
dct = {
    'Republican': [200, 250],
    'Democrat': [150, 300],
    'Independent': [50, 50],
}
crosstab = pd.DataFrame(dct, index=['Men', 'Women'])

# Perform the chi-squared test
chi2, pval, dof, expected = st.chi2_contingency(crosstab)

print(f'χ² = {chi2:.3f}, p = {pval:.4f}')

## χ² = 16.204, p = 0.0003

3 Homogeneity Test

The chi-squared homogeneity test answers the question “do these two (or more) samples come from the same population?”.

The homogeneity test is that same as the independence test in practice (ie the way it is calculated in Python is the same) but it is different conceptually:

In a chi-squared independence test there is one group of people (ie one sample of the population) surveyed and two things are measured: no independent variable and two dependent variables
In a chi-squared homogeneity test there are multiple groups surveyed but one thing is measured: one independent variable and one dependent variable

Because there is no mathematical difference between different types of variables, the calculation looks the same.

3.1 DisplayR Example

Using an example from DisplayR:

As part of an investigation into the effects of different living conditions, people who were living alone and people who were living with others were both surveyed. One question that was asked is whether or not a difference in living situation affects peoples’ eating habits.

Here is the recorded data:

Observed:

	I’m on a diet	I watch what I eat	I eat whatever I feel like
Living with others	25	146	124
Living alone	2	23	7

Check the flowchart:

Categorical (nominal) dependent variable - eating habits - with three categories
Categorical (nominal, binary) independent variable: living situation

So either a chi-squared homogeneity test or a two-sample Z-test would be appropriate. We’ll use a chi-squared test for this example and a Z-test for the next example.

# Collate the observed data
observed = np.array(
    [
        [25, 146, 124],
        [2, 23, 7]
    ]
)
# Calculate totals
row_totals = np.array([np.sum(observed, axis=1)])
col_totals = np.array([np.sum(observed, axis=0)])
n = np.sum(observed)
# Calculate the expected observations
expected = np.dot(row_totals.T, col_totals) / n

print(np.round(expected, 1))

## [[ 24.4 152.5 118.2]
##  [  2.6  16.5  12.8]]

Expected:

	I’m on a diet	I watch what I eat	I eat whatever I feel like
Living with others	24.4	152.5	118.2
Living alone	2.6	16.5	12.8

# Calculate the chi-square test statistic
chisq, p_value = st.chisquare(observed, expected)
# Sum the answers
chisq = np.sum(chisq)
# Degrees of freedom
rows = observed.shape[0]
cols = observed.shape[1]
dof = (rows - 1) * (cols - 1)
# Convert chi-square test statistic to p-value
p_value = 1 - st.chi2.cdf(chisq, dof)

print(f'χ² = {chisq:.3f}, p = {p_value:.3f}')

## χ² = 5.900, p = 0.052

To quote the source: “as this is greater than 0.05, by convention the conclusion is that the difference is due to sampling error, although the closeness of 0.05 to 0.052 makes this very much a ‘line ball’ conclusion.”

3.2 Statsmodels Example

A data set with a binary dependent variable and a nominal independent variable that has three levels can be addressed either with a chi-squared homogeneity test or a three-sample Z-test:

Let’s pretend we have three cohorts of students taking a test and 39, 57 and 54 pass out of total class sizes of 60, 70 and 65. Are the proportions of students passing the same in each cohort?

The dependent variable is binary (a student either passed or failed) and the independent variable is nominal with three levels (there are three cohorts of students). We can use a chi-squared homogeneity test as above:

# Collate the observed data
observed = np.array(
    [
        [39, 57, 54],
        [60 - 39, 70 - 57, 65 - 54]
    ]
)
# Calculate totals
row_totals = np.array([np.sum(observed, axis=1)])
col_totals = np.array([np.sum(observed, axis=0)])
n = np.sum(observed)
# Calculate the expected observations
expected = np.dot(row_totals.T, col_totals) / n
# Calculate the chi-square test statistic
chisq, p_value = st.chisquare(observed, expected)
# Sum the answers
chisq = np.sum(chisq)
# Degrees of freedom
rows = observed.shape[0]
cols = observed.shape[1]
dof = (rows - 1) * (cols - 1)
# Convert chi-square test statistic to p-value
p_value = 1 - st.chi2.cdf(chisq, dof)

print(f'χ² = {chisq:.3f}, p = {p_value:.3f}')

## χ² = 6.992, p = 0.030

Now, let’s treat the same example as if we wanted to do a Z-test:

3.2.1 Equivalence with the Three-Sample Z-Test

Following on from this page about the one-sample Z-test for a proportion and this page about the two-sample Z-test for a proportion, it seems natural to want to do an example with three samples:

Our hypotheses are as follows:

\(H_0: \pi_1 = \pi_2 = \pi_3\)
\(H_1: \pi_1 \neq \pi_2\), \(\pi_1 \neq \pi_3\) or \(\pi_2 \neq \pi_3\)

where \(\pi\) indicates the true proportion of ‘successes’ (passes) for each cohort. Unfortunately, it is not yet possible to answer this question by using a thee-sample Z-test with Statsmodels in Python; if we try we get a ‘Not Implemented’ error:

from statsmodels.stats.proportion import proportions_ztest

# Create example data: three cohorts of students take a test and these are the
# numbers that pass:
count = [39, 57, 54]
# While these are the total number in each cohort:
nobs = [60, 70, 65]

z_stat, p_value = proportions_ztest(count, nobs)

NotImplementedError: more than two samples are not implemented yet

(Note also that Fisher’s exact test for three (or more) groups is also currently not implemented in any Python package)

However, as the Z-test for a proportion is mathematically equivalent to a chi-squared test, we can use proportions_chisquare() from Statsmodels instead:

from statsmodels.stats.proportion import proportions_chisquare

# Create example data: three cohorts of students take a test and these are the
# numbers that pass in each cohort:
count = [39, 57, 54]
# While these are the total number of students in each cohort:
nobs = [60, 70, 65]

# Three-Sample Chi-Squared Test for a Proportion
chi2_stat, p_value, table = proportions_chisquare(count, nobs)

Note that this takes nobs - the total number of observations - as the second input. This is different from SciPy’s chisquare() function which takes the numbers of observations in each individual group as its inputs.

As its outputs, proportions_chisquare() gives us the tabulation of the number of observations:

print(table[0])

## [[39. 21.]
##  [57. 13.]
##  [54. 11.]]

…as well as the tabulation of expected observations:

print(table[1])

## [[46.15384615 13.84615385]
##  [53.84615385 16.15384615]
##  [50.         15.        ]]

…and the results:

print(f'χ² = {chi2_stat:.3f}, p = {p_value:.3f}')

## χ² = 6.992, p = 0.030

This low p-value can be understood by looking at the percentage of students who passed in each of the three cohorts:

print(np.round(np.array(count) / np.array(nobs) * 100, 2))

## [65.   81.43 83.08]

While the percentage of students achieving a pass mark was very similar for cohorts 2 and 3 (about 82%) the percentage was significantly lower for cohort 1 (65%). So, we shouldn’t be surprised that the p-value suggests significance: the cohorts appear to be different!

4 Test for Trend (Cochran–Armitage Test)

When a data set has a binary nominal input variable and an ordinal output variable (or vice-versa), a chi-squared test for trend can be used¹. As a reminder:

A nominal variable has no rank or order, for example ‘gender’ (male or female) or ‘presence of illness’ (healthy or diseased). Both these examples are binary (only two options) which is a subset of the nominal data type.
An ordinal variable does have a rank or an order but the scale is arbitrary, such as ‘level of satisfaction’ (unsatisfied, neutral, satisfied) or ‘dosage group’ (low, medium, high)
Both are categorical variables

The chi-squared test for trend is also known as the “Cochran–Armitage test for trend” (see here for more). It is not yet implemented in Python - neither in SciPy nor in Statsmodels - but it is implemented in R where it is called the prop.trend.test (see the documentation here). What this means is that we can work through an example of this test in Python and then use the R function to check that what we have done is correct.

4.1 Influential Points Example

The first example we’ll use comes from Influential Points. The raw data is given as a contingency table, but let’s go a step backwards and generate this ourselves from a data set in long format to make it more realistic:

# Generate raw data
x = []
y = []
x.extend(['Low'] * 5)
y.extend(['Infected'] * 5)
x.extend(['Medium'] * 14)
y.extend(['Infected'] * 14)
x.extend(['High'] * 18)
y.extend(['Infected'] * 18)
x.extend(['Low'] * 25)
y.extend(['Uninfected'] * 25)
x.extend(['Medium'] * 16)
y.extend(['Uninfected'] * 16)
x.extend(['High'] * 12)
y.extend(['Uninfected'] * 12)

We now have two lists which are each 90 elements long, corresponding to the 90 patients in our fictitious example. Now we can create the contingency table - also known as a cross tabulation - using the crosstab() function from Pandas:

import pandas as pd

# Create a contingency table from the raw data
ct = pd.crosstab(y, x)

print(ct)

## col_0       High  Low  Medium
## row_0                        
## Infected      18    5      14
## Uninfected    12   25      16

The columns are not in the same order as on the Influential Points site and the table as a whole is not in the same orientation, but it doesn’t matter.

The next thing to define is the scores, which are the arbitrary rankings (weights) which we will attach to each of the output variables (“High” = 3, “Medium” = 2, “Low” = 1). These need to be in the same order as the columns of the contingency table they refer to (High, Low, Medium):

# Scores
score = [3, 1, 2]

Now we can perform the chi-squared test for trend. The formula used in the example has three intermediate variables: \(T_1\), \(T_2\) and \(T_3\):

import numpy as np

# T1
weighted_ct = ct * score
t_1 = weighted_ct.iloc[0, :].sum()
# T2
weighted_col_sums = weighted_ct.sum()
t_2 = weighted_col_sums.sum()
# T3
square_weighted_ct = ct * np.array(score)**2
square_weighted_col_sums = square_weighted_ct.sum()
t_3 = square_weighted_col_sums.sum()

We also need the overall sample size:

# Sample size
col_sums = ct.sum(axis=1)
n = col_sums.sum()

These are all the elements we need to plug into their formula for χ²:

# χ²
v = np.prod(col_sums) * (n * t_3 - t_2**2) / (n**2 * (n - 1))
chisq = (t_1 - (col_sums.iloc[0] * t_2 / n))**2 / v

print(f'χ² = {chisq:.2f}')

## χ² = 11.51

This matches their value of 11.51.

The last step is to test this value for significance by calculating the p-value by using the CDF (cumulative distribution function) for the chi-squared distribution:

# Degrees of freedom
dof = np.prod(np.array(ct.shape) - 1) - 1
# p-value
# The probability of obtaining test results at least as extreme as the result
# actually observed, under the assumption that the null hypothesis is correct
p_value = 1 - st.chi2.cdf(chisq, dof)

print(f'p = {p_value:.4f}')

## p = 0.0007

This matches their result of 0.0007.

Using R we get the following:

x  <- c(5, 14, 18)
n <- c(30, 30, 30)
prop.trend.test(x, n, c(1, 2, 3))

## 
##  Chi-squared Test for Trend in Proportions
## 
## data:  x out of n ,
##  using scores: 1 2 3
## X-squared = 11.634, df = 1, p-value = 0.0006474

This is not 100% the same, but it’s close enough.

4.2 Another Example

Let’s take an imaginary example where students in various classes were asked if they were satisfied with the feedback they got for an assignment. We could hypothesise that there is an inverse trend between the amount of feedback a teaching assistant has the time to give and the size of the class, so let’s look for this:

\(H_0:\) No true linear trend between the true proportion satisfied with assignment feedback and the size of class attended
\(H_1:\) True linear trend between the true proportion satisfied with assignment feedback and the size of class attended

# Number of students satisfied with their level of feedback
count = [20, 15, 12, 8]
# Number of students who got feedback
nobs = [30, 27, 23, 23]
# Relative size of class
weights = [1, 2, 3, 4]

Let’s visualise the data:

import matplotlib.pyplot as plt

percent_satisfied = np.array(count) / np.array(nobs)
plt.plot(weights, percent_satisfied)
plt.ylabel('Students Satisfied with their Feedback (%)')
plt.xlabel('Relative Class Size')

Now let’s look for a trend:

# Chi-squared test for trend
count_total = np.sum(count)
nobs_total = np.sum(nobs)
fails_total = nobs_total - count_total
T1 = np.sum(count * np.array(weights))
T2 = np.sum(nobs * np.array(weights))
T3 = np.sum(nobs * np.array(weights)**2)
V = count_total * fails_total * (nobs_total * T3 - T2**2) / (
    nobs_total**2 * (nobs_total - 1)
)
chi2_stat = (T1 - (count_total * T2 / nobs_total))**2 / V
p_value = 1 - st.chi2.cdf(chi2_stat, df=1)

print(f'χ² = {chi2_stat:.3f}, p = {p_value:.3f}')

## χ² = 5.026, p = 0.025

The chi-squared test for trend gives a p-value of less than 0.05 so we reject the null hypothesis in favour of the alternative. We conclude that there is evidence that the proportion of students satisfied with assignment feedback is truly related to the size of the class attended. As the size of the class attended increases, the proportion of satisfied students decreases.

This is consistent with the plot of the data which shows a downward trend. However, unlike linear regression, this hypothesis test does not provide an estimate of the constant rate of decrease of the proportion, ie an estimate of the gradient of the plot.

5 A Function to Interpret p-values

Consider a new example:

A company manufactures widgets. The mass of the widgets that are produced is controlled through batch testing: rather than weighing every single one they take a batch and weigh those instead. If the batch as a whole is too light or too heavy, then all of the widgets in that batch are sent for re-manufacture. The company’s database shows that 90% of all batches are made within the acceptable mass range, while 5% are too heavy and 5% are too light.

If the company were to install new manufacturing equipment they would want to know if they could expect this same distribution of acceptance and rejection. If they tested the first 1000 batches after installing the new equipment and 30 were too light and 60 were to heavy, would this be the case?

The problem above suggests that the observed numbers of too light, acceptable and too heavy batches was [30, 910, 60] whereas the expected numbers were [50, 900, 50], respectively.

The null and alternative hypotheses are as follows:

\(H_0\): The new equipment follows the same distribution as the historical data
\(H_1\): The new equipment has a distribution different to the historical data

def perform_chi_squared(obs, exp):
    """
    Perform a chi-squared test and interpret the p-value.

    The p-value is the probability of obtaining test results at least as
    extreme as the result actually observed, under the assumption that the null
    hypothesis is correct.
    """
    chisq, p = st.chisquare(obs, exp)
    if p <= 0.001:
        significance = '***'
        interpretation = 'H_0 is rejected at a 99.9% confidence level.'
    elif p <= 0.01:
        significance = '**'
        interpretation = 'H_0 is rejected at a 99% confidence level.'
    elif p <= 0.05:
        significance = '*'
        interpretation = 'H_0 is rejected at a 95% confidence level.'
    else:
        significance = ''
        interpretation = 'H_0 is not rejected'
    return p, significance, interpretation


observations = [30, 910, 60]
expected = [50, 900, 50]
p, sig, inter = perform_chi_squared(observations, expected)

print(f'p = {p:.3f}{sig}. {inter}')

## p = 0.006**. H_0 is rejected at a 99% confidence level.

Thus, at a 99% confidence level, we can say that the new equipment has a different distribution of overweight-underweight widget production compared to the old equipment.

⇦ Back

Swinscow, TDV. Study design and choosing a statistical test. In: Campbell, MJ, editors. Statistics at Square One. BMJ Publishing Group; 1997. Available here. Jump to reference: ↩︎

Statistics in Python:Chi-Squared Tests

1 Goodness-of-Fit Test

1.1 Wikipedia Example

1.1.1 Equivalence with the One-Sample Z-Test

1.2 Crash Course Example

1.2.1 Degrees of Freedom

1.2.2 Doing it by Hand

2 Independence Test

2.1 Crash Course Example

2.1.1 Solution from First Principles

2.1.2 Solution Directly from the Contingency Table

2.1.3 Solution from Raw Data in Long Format

2.1.4 Equivalence with the Two-Sample Z-Test

2.2 Stat Trek Example

2.2.1 Solution from First Principles

2.2.2 Solution Directly from the Contingency Table

3 Homogeneity Test

3.1 DisplayR Example

3.2 Statsmodels Example

3.2.1 Equivalence with the Three-Sample Z-Test

4 Test for Trend (Cochran–Armitage Test)

4.1 Influential Points Example

4.2 Another Example

5 A Function to Interpret p-values

Statistics in Python:
Chi-Squared Tests