A chi-squared (χ², pronounced “kai-squared”) test will often be used with discrete, categorical data. For example:
Chi-squared tests are not relevant for continuous data, ie where each measurement/observation in the data set is a ‘value’ (eg a percentage, a height in cm, a weight in kg, etc). Also note that for a chi-squared test to be valid there must be more than 5 samples of each type. See the Wikipedia page here for more info.
There are three main chi-squared tests (known as Pearson’s chi-squared tests):
In Python, all three are performed using the same function: chisquare()
from the SciPy stats package. For more info about this function see its documentation. In the special cases where the dependent variable is binary and the independent variable has one, two or three levels then these three tests give the same answers as the one-, two- and three-sample Z-tests, respectively. Examples are provided that demonstrate this.
A less-common chi-squared test is the “test for trend”. This is not yet implemented in Python but worked examples are nonetheless provided on this page.
All four types of chi-squared test are highlighted in green in the below flowchart, illustrating the decision processes that would lead to each being chosen for use with a particular data set:
Assumptions of a Chi-Square Test
In addition to the data types and samples sizes needing to be as specified above, there are three other assumptions that need to be met in order for a chi-squared test to be valid:
Python Packages
The code on this page uses the NumPy, SciPy, pandas, Statsmodels and Matplotlib packages. These can be installed from the terminal with:
# "python3.12" corresponds to the version of Python you have installed and are using
$ python3.12 -m pip install numpy
$ python3.12 -m pip install scipy
$ python3.12 -m pip install pandas
$ python3.12 -m pip install statsmodels
$ python3.12 -m pip install matplotlib
This test answers the question “does my distribution of numbers come from the distribution I expect?”.
The following example comes from Wikipedia:
If you randomly picked 100 people out of a population and 44 of that sample were men, is that population made up of 50% men or is it skewed?
You would have observed 44 men and 56 women but your expectation would have been 50 men and 50 women. Let’s check to see if we can use a chi-squared test to test our question:
Thus, the assumptions required for a chi-squared test are met and - as per the flowchart shown above - we could either use a one-sample Z-test or a χ² goodness-of-fit test to investigate this question. We will do both, but let’s start with the latter: testing the ‘goodness-of-fit’ of the sample with the expected distribution:
from scipy import stats as st
# We observed 44 men and 56 women
observed = [44, 56]
# We expected 50 men and 50 women
expected = [50, 50]
# Perform the chi-squared goodness-of-fit test
chisq, p = st.chisquare(observed, expected)
print(f'χ² = {chisq:.2f}, p = {p:.2f}')
## χ² = 1.44, p = 0.23
As you can see, the chisquare()
function takes the numbers that were observed in a list as the first argument and the numbers that were expected in a list as the second argument. The p-value of 0.23 suggests that there is a 23% chance that the population is 50% male/female.
As it happens, if you omit the second argument the chisquare()
function assumes you expect equal numbers of each type of observation by default. Therefore, in this example (because we expect equal numbers of men and women) we can actually leave the second argument out:
# We observed 44 men and 56 women
observed = [44, 56]
# Perform the chi-squared goodness-of-fit test
chisq, p = st.chisquare(observed)
print(f'χ² = {chisq:.2f}, p = {p:.2f}')
## χ² = 1.44, p = 0.23
If, in a different sample, we observed 50 men and 50 women, the test would tell use that there is a 100% chance that the population is 50% male/female:
# We observed 50 men and 50 women
observed = [50, 50]
# Perform the chi-squared goodness-of-fit test
chisq, p = st.chisquare(observed)
print(f'χ² = {chisq:.2f}, p = {p:.2f}')
## χ² = 0.00, p = 1.00
Note that the test does NOT work properly if you give it your data as proportions or percentages:
observed = [0.44, 0.56]
chisq, p = st.chisquare(observed)
print(f'χ² = {chisq:.2f}, p = {p:.2f}')
## χ² = 0.01, p = 0.90
You must always provide the actual numbers observed and expected.
As mentioned, the flowchart for choosing a statistical test suggests that we can use either a one-sample Z-test or a χ² goodness-of-fit test. The reason for this is that we only have one sample (ie there is no independent variable) and our dependent variable is binary. So let’s have a look at what the proportions_ztest()
function from Statsmodels gives us (see this page for more on the one-sample Z-test):
from statsmodels.stats.proportion import proportions_ztest
# Number of men
count = 44
# Number of observations
nobs = 100
# Proportion under the null hypothesis
pi_0 = 0.5
# Perform a one-sample Z-test for a proportion
z_stat, p_value = proportions_ztest(count, nobs, value=pi_0, prop_var=pi_0)
print(f'Z-statistic = {z_stat:.2f}, p = {p_value:.2f}')
## Z-statistic = -1.20, p = 0.23
So, the Z-statistic is different to the χ² value (-1.20 vs 1.44) but the p-values are the same. In other words, these two tests are different ways to get to the same answer. Of course, that’s only true in the special case where you have a binary dependent variable and no independent variable (ie one sample). Our next example will have more than two levels to the dependent variable, so it will not be possible to do it with the one-sample Z-test.
The following example comes from this Crash Course Youtube video:
It is claimed by a game’s developers that 15% of players choose a ‘healer’ as their character, 20% choose a ‘tank’, 20% choose an ‘assassin’ and 45% choose a ‘fighter’. You do some of your own observations of people playing and you record that 25 choose healers, 35 choose tanks, 50 choose assassins and 90 choose fighters. Does this data support the distribution purported by the developers?
Firstly, count the number of characters you observed and calculate the proportion of each type that you would have expected to see, if the developers were correct:
import numpy as np
# Numbers you observed
observed = np.array([25, 35, 50, 90])
# Total observed
n = np.sum(observed)
# Distribution purported by the developers
distribution = np.array([0.15, 0.20, 0.20, 0.45])
# Calculate the expected distribution
expected = n * distribution
print(expected)
## [30. 40. 40. 90.]
So here’s our data in a table:
Healer | Tank | Assassin | Fighter | |
---|---|---|---|---|
Observed | 25 | 35 | 50 | 90 |
Expected | 30 | 40 | 40 | 90 |
Let’s double check that the goodness-of-fit test is appropriate:
So a chi-squared test is appropriate:
# Perform the chi-squared goodness-of-fit test
chisq, p = st.chisquare(observed, expected)
print(f'χ² = {chisq:.3f}, p = {p:.2f}')
## χ² = 3.958, p = 0.27
This matches their answer.
Note that this example has 3 degrees of freedom. In general, the formula for degrees of freedom is (nrows - 1) * (ncols - 1)
:
nrows = 2
ncols = 4
dof = (nrows - 1) * (ncols - 1)
print(f'Degrees of freedom = {dof}')
## Degrees of freedom = 3
This example can also be done manually by calculating the chi-square test statistic by hand:
\[\chi^2 = \sum_{i=1}^{n}\frac{\left(observed_i - expected_i\right)^2}{expected_i}\]
chisq = (25 - 30)**2 / 30 + (35 - 40)**2 / 40 + (50 - 40)**2 / 40 + (90 - 90)**2 / 90
print(f'χ² = {chisq:.3f}')
## χ² = 3.958
Then, this test statistic can be converted into a p-value by using the chi-distribution for 3 degrees of freedom:
p_value = 1 - st.chi2.cdf(chisq, dof)
print(f'p = {p_value:.2f}')
## p = 0.27
As the p-value is above 0.05 we fail to reject the null hypothesis that the distribution of characters chosen by players is as the developers claimed.
The chi-squared independence test answers the question “are these two categorical variables independent of each other?”. Data used in goodness-of-fit tests has one categorical dependent variable, so this is one step more complicated.
Again, using an example from the Crash Course Youtube video:
Is someone’s Hogwarts house independent of whether or not they like pineapple on pizza?
Here’s the raw data as presented in the video:
Observed:
Gryffindor | Hufflepuff | Ravenclaw | Slytherin | |
---|---|---|---|---|
No to pineapple on pizza | 79 | 122 | 204 | 74 |
Yes to pineapple on pizza | 82 | 130 | 240 | 69 |
The above is known as a contingency table or a cross tabulation.
Let’s check the flowchart to confirm that a chi-squared test of independence is appropriate: we have two dependent variables, both are categorical (specifically, nominal) and no independent variable (only one sample was taken). So we’re all good.
If the two dependent variables - Hogwarts house and pineapple on pizza preference - are indeed independent we can work out what we would expect the data to look like given the number of people who were surveyed:
#
# Are these two categorical variables independent?
#
# Collate the observed data
observed = np.array(
[
[79, 122, 204, 74],
[82, 130, 240, 69]
]
)
# Calculate totals
row_totals = np.array([np.sum(observed, axis=1)])
col_totals = np.array([np.sum(observed, axis=0)])
n = np.sum(observed)
# Calculate the expected observations
expected = np.dot(row_totals.T, col_totals) / n
print(np.round(expected, 2))
## [[ 77.12 120.71 212.68 68.5 ]
## [ 83.88 131.29 231.32 74.5 ]]
Here it is in a table:
Expected:
Gryffindor | Hufflepuff | Ravenclaw | Slytherin | |
---|---|---|---|---|
No to pineapple on pizza | 77.12 | 120.71 | 212.68 | 68.5 |
Yes to pineapple on pizza | 83.88 | 131.29 | 231.32 | 74.5 |
In Python, if we try to use the chisquare()
function in the same way as we did before it will assume we are trying to do four goodness-of-fit tests and so it will give us four answers:
# Calculate the chi-square test statistic
chisq, p_value = st.chisquare(observed, expected)
print(chisq, p_value)
## [0.08805996 0.02654308 0.67933326 0.84857406] [0.76665816 0.87058106 0.40981641 0.35695604]
So we need to do some post-hoc calculation on our own:
# Sum the answers
chisq = np.sum(chisq)
# Degrees of freedom
rows = observed.shape[0]
cols = observed.shape[1]
dof = (rows - 1) * (cols - 1)
# Convert chi-square test statistic to p-value
p_value = 1 - st.chi2.cdf(chisq, dof)
print(f'χ² = {chisq:.3f}, p = {p_value:.2f}')
## χ² = 1.643, p = 0.65
This is the same as what they got in the video (note that Python rounds to the nearest even number by default and the rounded p-value of 0.6 matches what the video shows).
As the p-value is larger than 0.05 we fail to reject the null hypothesis that the distribution of Hogwarts houses is the same for both pineapple on pizza lovers and haters.
A quicker option would have been to create the contingency table of the raw data as a data frame and use the chi2_contingency()
function directly on that:
import pandas as pd
# Create a contingency table (also known as a cross tabulation)
dct = {
'Gryffindor': [79, 82],
'Hufflepuff': [122, 130],
'Ravenclaw': [204, 240],
'Slytherin': [74, 69],
}
crosstab = pd.DataFrame(dct, index=['No', 'Yes'])
print(crosstab)
## Gryffindor Hufflepuff Ravenclaw Slytherin
## No 79 122 204 74
## Yes 82 130 240 69
chi2, pval, dof, expected = st.chi2_contingency(crosstab)
print(f'χ² = {chi2:.3f}, p = {pval:.2f}, degrees of freedom = {dof}')
## χ² = 1.643, p = 0.65, degrees of freedom = 3
This also gives us the expected results under the assumption of independence:
print(expected)
## [[ 77.119 120.708 212.676 68.497]
## [ 83.881 131.292 231.324 74.503]]
Usually in the real world we will get raw data in long format. In this example it has been given to us as a contingency table, but for the sake of demonstration here’s how to reverse-engineer it into long format:
# Reverse-engineer the raw data
s = crosstab.stack()
df = s.index.repeat(s).to_frame(index=False)
columns = {
0: 'Pineapple on pizza?',
1: 'Hogwarts house'
}
df = df.rename(columns=columns)
print(df)
## Pineapple on pizza? Hogwarts house
## 0 No Gryffindor
## 1 No Gryffindor
## 2 No Gryffindor
## 3 No Gryffindor
## 4 No Gryffindor
## .. ... ...
## 995 Yes Slytherin
## 996 Yes Slytherin
## 997 Yes Slytherin
## 998 Yes Slytherin
## 999 Yes Slytherin
##
## [1000 rows x 2 columns]
If this was indeed the format we received the raw data in, we could then convert it into a contingency table and use chi2_contingency()
as above:
table = pd.crosstab(df['Pineapple on pizza?'], df['Hogwarts house'])
chi2, pval, dof, expected = st.chi2_contingency(table)
print(f'χ² = {chi2:.3f}, p = {pval:.2f}')
## χ² = 1.643, p = 0.65
Let’s reduce the example so that we only have two houses:
Observed:
Gryffindor | Hufflepuff | |
---|---|---|
No to pineapple on pizza | 79 | 122 |
Yes to pineapple on pizza | 82 | 130 |
We can still use the chi-squared independence test:
# Collate the observed data
observed = np.array(
[
[79, 122],
[82, 130]
]
)
# Calculate totals
row_totals = np.array([np.sum(observed, axis=1)])
col_totals = np.array([np.sum(observed, axis=0)])
n = np.sum(observed)
# Calculate the expected observations
expected = np.dot(row_totals.T, col_totals) / n
# Calculate the chi-square test statistic
chisq, p_value = st.chisquare(observed, expected)
# Sum the answers
chisq = np.sum(chisq)
# Degrees of freedom
rows = observed.shape[0]
cols = observed.shape[1]
dof = (rows - 1) * (cols - 1)
# Convert chi-square test statistic to p-value
p_value = 1 - st.chi2.cdf(chisq, dof)
print(f'χ² = {chisq:.3f}, p = {p_value:.3f}')
## χ² = 0.017, p = 0.897
However, we can also use the two-sample Z-test for a proportion (see more here) to get the same p-value (albeit from a different test statistic):
where \(\pi\) is the proportion who gave a certain response.
from statsmodels.stats.proportion import proportions_ztest
# Number of one type of response
count = [79, 122]
# Number of observations
nobs = [79 + 82, 122 + 130]
# Perform a two-sample Z-test for a proportion
z_stat, p_value = proportions_ztest(count, nobs)
print(f'Z-statistic = {z_stat:.3f}; p = {p_value:.3f}')
## Z-statistic = 0.130; p = 0.897
Note that it doesn’t make a difference mathematically if we swap the variables:
# Number of one type of response
count = [79, 82]
# Number of observations
nobs = [79 + 122, 82 + 130]
# Perform a two-sample Z-test for a proportion
z_stat, p_value = proportions_ztest(count, nobs)
print(f'Z-statistic = {z_stat:.3f}; p = {p_value:.3f}')
## Z-statistic = 0.130; p = 0.897
Using an example from Stat Trek:
Do men and women in America have different voting preferences?
The results of a fictitious survey are as follows:
Observed:
Republican | Democrat | Independent | |
---|---|---|---|
Men | 200 | 150 | 50 |
Women | 250 | 300 | 50 |
Quick check of the flowchart: we have two categorical dependent variables (voting preference and gender, both of which are nominal and the latter is binary) and no independent variable (only one survey was taken). So a chi-squared test of independence is the way to go.
# Collate the observed data
observed = np.array(
[
[200, 150, 50],
[250, 300, 50]
]
)
# Calculate totals
row_totals = np.array([np.sum(observed, axis=1)])
col_totals = np.array([np.sum(observed, axis=0)])
n = np.sum(observed)
# Calculate the expected observations
expected = np.dot(row_totals.T, col_totals) / n
# Calculate the chi-square test statistic
chisq, p_value = st.chisquare(observed, expected)
# Sum the answers
chisq = np.sum(chisq)
# Degrees of freedom
rows = observed.shape[0]
cols = observed.shape[1]
dof = (rows - 1) * (cols - 1)
# Convert chi-square test statistic to p-value
p_value = 1 - st.chi2.cdf(chisq, dof)
print(f'χ² = {chisq:.3f}, p = {p_value:.4f}')
## χ² = 16.204, p = 0.0003
To quote the source: “since the p-value (0.0003) is less than the significance level (0.05), we cannot accept the null hypothesis. Thus, we conclude that there is a relationship between gender and voting preference”.
# Create a contingency table (also known as a cross tabulation)
dct = {
'Republican': [200, 250],
'Democrat': [150, 300],
'Independent': [50, 50],
}
crosstab = pd.DataFrame(dct, index=['Men', 'Women'])
# Perform the chi-squared test
chi2, pval, dof, expected = st.chi2_contingency(crosstab)
print(f'χ² = {chi2:.3f}, p = {pval:.4f}')
## χ² = 16.204, p = 0.0003
The chi-squared homogeneity test answers the question “do these two (or more) samples come from the same population?”.
The homogeneity test is that same as the independence test in practice (ie the way it is calculated in Python is the same) but it is different conceptually:
Because there is no mathematical difference between different types of variables, the calculation looks the same.
Using an example from DisplayR:
As part of an investigation into the effects of different living conditions, people who were living alone and people who were living with others were both surveyed. One question that was asked is whether or not a difference in living situation affects peoples’ eating habits.
Here is the recorded data:
Observed:
I’m on a diet | I watch what I eat | I eat whatever I feel like | |
---|---|---|---|
Living with others | 25 | 146 | 124 |
Living alone | 2 | 23 | 7 |
Check the flowchart:
So either a chi-squared homogeneity test or a two-sample Z-test would be appropriate. We’ll use a chi-squared test for this example and a Z-test for the next example.
# Collate the observed data
observed = np.array(
[
[25, 146, 124],
[2, 23, 7]
]
)
# Calculate totals
row_totals = np.array([np.sum(observed, axis=1)])
col_totals = np.array([np.sum(observed, axis=0)])
n = np.sum(observed)
# Calculate the expected observations
expected = np.dot(row_totals.T, col_totals) / n
print(np.round(expected, 1))
## [[ 24.4 152.5 118.2]
## [ 2.6 16.5 12.8]]
Expected:
I’m on a diet | I watch what I eat | I eat whatever I feel like | |
---|---|---|---|
Living with others | 24.4 | 152.5 | 118.2 |
Living alone | 2.6 | 16.5 | 12.8 |
# Calculate the chi-square test statistic
chisq, p_value = st.chisquare(observed, expected)
# Sum the answers
chisq = np.sum(chisq)
# Degrees of freedom
rows = observed.shape[0]
cols = observed.shape[1]
dof = (rows - 1) * (cols - 1)
# Convert chi-square test statistic to p-value
p_value = 1 - st.chi2.cdf(chisq, dof)
print(f'χ² = {chisq:.3f}, p = {p_value:.3f}')
## χ² = 5.900, p = 0.052
To quote the source: “as this is greater than 0.05, by convention the conclusion is that the difference is due to sampling error, although the closeness of 0.05 to 0.052 makes this very much a ‘line ball’ conclusion.”
A data set with a binary dependent variable and a nominal independent variable that has three levels can be addressed either with a chi-squared homogeneity test or a three-sample Z-test:
Let’s pretend we have three cohorts of students taking a test and 39, 57 and 54 pass out of total class sizes of 60, 70 and 65. Are the proportions of students passing the same in each cohort?
The dependent variable is binary (a student either passed or failed) and the independent variable is nominal with three levels (there are three cohorts of students). We can use a chi-squared homogeneity test as above:
# Collate the observed data
observed = np.array(
[
[39, 57, 54],
[60 - 39, 70 - 57, 65 - 54]
]
)
# Calculate totals
row_totals = np.array([np.sum(observed, axis=1)])
col_totals = np.array([np.sum(observed, axis=0)])
n = np.sum(observed)
# Calculate the expected observations
expected = np.dot(row_totals.T, col_totals) / n
# Calculate the chi-square test statistic
chisq, p_value = st.chisquare(observed, expected)
# Sum the answers
chisq = np.sum(chisq)
# Degrees of freedom
rows = observed.shape[0]
cols = observed.shape[1]
dof = (rows - 1) * (cols - 1)
# Convert chi-square test statistic to p-value
p_value = 1 - st.chi2.cdf(chisq, dof)
print(f'χ² = {chisq:.3f}, p = {p_value:.3f}')
## χ² = 6.992, p = 0.030
Now, let’s treat the same example as if we wanted to do a Z-test:
Following on from this page about the one-sample Z-test for a proportion and this page about the two-sample Z-test for a proportion, it seems natural to want to do an example with three samples:
Our hypotheses are as follows:
where \(\pi\) indicates the true proportion of ‘successes’ (passes) for each cohort. Unfortunately, it is not yet possible to answer this question by using a thee-sample Z-test with Statsmodels in Python; if we try we get a ‘Not Implemented’ error:
from statsmodels.stats.proportion import proportions_ztest
# Create example data: three cohorts of students take a test and these are the
# numbers that pass:
count = [39, 57, 54]
# While these are the total number in each cohort:
nobs = [60, 70, 65]
z_stat, p_value = proportions_ztest(count, nobs)
NotImplementedError: more than two samples are not implemented yet
(Note also that Fisher’s exact test for three (or more) groups is also currently not implemented in any Python package)
However, as the Z-test for a proportion is mathematically equivalent to a chi-squared test, we can use proportions_chisquare()
from Statsmodels instead:
from statsmodels.stats.proportion import proportions_chisquare
# Create example data: three cohorts of students take a test and these are the
# numbers that pass in each cohort:
count = [39, 57, 54]
# While these are the total number of students in each cohort:
nobs = [60, 70, 65]
# Three-Sample Chi-Squared Test for a Proportion
chi2_stat, p_value, table = proportions_chisquare(count, nobs)
Note that this takes nobs
- the total number of observations - as the second input. This is different from SciPy’s chisquare()
function which takes the numbers of observations in each individual group as its inputs.
As its outputs, proportions_chisquare()
gives us the tabulation of the number of observations:
print(table[0])
## [[39. 21.]
## [57. 13.]
## [54. 11.]]
…as well as the tabulation of expected observations:
print(table[1])
## [[46.15384615 13.84615385]
## [53.84615385 16.15384615]
## [50. 15. ]]
…and the results:
print(f'χ² = {chi2_stat:.3f}, p = {p_value:.3f}')
## χ² = 6.992, p = 0.030
This low p-value can be understood by looking at the percentage of students who passed in each of the three cohorts:
print(np.round(np.array(count) / np.array(nobs) * 100, 2))
## [65. 81.43 83.08]
While the percentage of students achieving a pass mark was very similar for cohorts 2 and 3 (about 82%) the percentage was significantly lower for cohort 1 (65%). So, we shouldn’t be surprised that the p-value suggests significance: the cohorts appear to be different!
When a data set has a binary nominal input variable and an ordinal output variable (or vice-versa), a chi-squared test for trend can be used1. As a reminder:
The chi-squared test for trend is also known as the “Cochran–Armitage test for trend” (see here for more). It is not yet implemented in Python - neither in SciPy nor in Statsmodels - but it is implemented in R where it is called the prop.trend.test
(see the documentation here). What this means is that we can work through an example of this test in Python and then use the R function to check that what we have done is correct.
The first example we’ll use comes from Influential Points. The raw data is given as a contingency table, but let’s go a step backwards and generate this ourselves from a data set in long format to make it more realistic:
# Generate raw data
x = []
y = []
x.extend(['Low'] * 5)
y.extend(['Infected'] * 5)
x.extend(['Medium'] * 14)
y.extend(['Infected'] * 14)
x.extend(['High'] * 18)
y.extend(['Infected'] * 18)
x.extend(['Low'] * 25)
y.extend(['Uninfected'] * 25)
x.extend(['Medium'] * 16)
y.extend(['Uninfected'] * 16)
x.extend(['High'] * 12)
y.extend(['Uninfected'] * 12)
We now have two lists which are each 90 elements long, corresponding to the 90 patients in our fictitious example. Now we can create the contingency table - also known as a cross tabulation - using the crosstab()
function from Pandas:
import pandas as pd
# Create a contingency table from the raw data
ct = pd.crosstab(y, x)
print(ct)
## col_0 High Low Medium
## row_0
## Infected 18 5 14
## Uninfected 12 25 16
The columns are not in the same order as on the Influential Points site and the table as a whole is not in the same orientation, but it doesn’t matter.
The next thing to define is the scores, which are the arbitrary rankings (weights) which we will attach to each of the output variables (“High” = 3, “Medium” = 2, “Low” = 1). These need to be in the same order as the columns of the contingency table they refer to (High, Low, Medium):
# Scores
score = [3, 1, 2]
Now we can perform the chi-squared test for trend. The formula used in the example has three intermediate variables: \(T_1\), \(T_2\) and \(T_3\):
import numpy as np
# T1
weighted_ct = ct * score
t_1 = weighted_ct.iloc[0, :].sum()
# T2
weighted_col_sums = weighted_ct.sum()
t_2 = weighted_col_sums.sum()
# T3
square_weighted_ct = ct * np.array(score)**2
square_weighted_col_sums = square_weighted_ct.sum()
t_3 = square_weighted_col_sums.sum()
We also need the overall sample size:
# Sample size
col_sums = ct.sum(axis=1)
n = col_sums.sum()
These are all the elements we need to plug into their formula for χ²:
# χ²
v = np.prod(col_sums) * (n * t_3 - t_2**2) / (n**2 * (n - 1))
chisq = (t_1 - (col_sums.iloc[0] * t_2 / n))**2 / v
print(f'χ² = {chisq:.2f}')
## χ² = 11.51
This matches their value of 11.51.
The last step is to test this value for significance by calculating the p-value by using the CDF (cumulative distribution function) for the chi-squared distribution:
# Degrees of freedom
dof = np.prod(np.array(ct.shape) - 1) - 1
# p-value
# The probability of obtaining test results at least as extreme as the result
# actually observed, under the assumption that the null hypothesis is correct
p_value = 1 - st.chi2.cdf(chisq, dof)
print(f'p = {p_value:.4f}')
## p = 0.0007
This matches their result of 0.0007.
Using R we get the following:
x <- c(5, 14, 18)
n <- c(30, 30, 30)
prop.trend.test(x, n, c(1, 2, 3))
##
## Chi-squared Test for Trend in Proportions
##
## data: x out of n ,
## using scores: 1 2 3
## X-squared = 11.634, df = 1, p-value = 0.0006474
This is not 100% the same, but it’s close enough.
Let’s take an imaginary example where students in various classes were asked if they were satisfied with the feedback they got for an assignment. We could hypothesise that there is an inverse trend between the amount of feedback a teaching assistant has the time to give and the size of the class, so let’s look for this:
# Number of students satisfied with their level of feedback
count = [20, 15, 12, 8]
# Number of students who got feedback
nobs = [30, 27, 23, 23]
# Relative size of class
weights = [1, 2, 3, 4]
Let’s visualise the data:
import matplotlib.pyplot as plt
percent_satisfied = np.array(count) / np.array(nobs)
plt.plot(weights, percent_satisfied)
plt.ylabel('Students Satisfied with their Feedback (%)')
plt.xlabel('Relative Class Size')
Now let’s look for a trend:
# Chi-squared test for trend
count_total = np.sum(count)
nobs_total = np.sum(nobs)
fails_total = nobs_total - count_total
T1 = np.sum(count * np.array(weights))
T2 = np.sum(nobs * np.array(weights))
T3 = np.sum(nobs * np.array(weights)**2)
V = count_total * fails_total * (nobs_total * T3 - T2**2) / (
nobs_total**2 * (nobs_total - 1)
)
chi2_stat = (T1 - (count_total * T2 / nobs_total))**2 / V
p_value = 1 - st.chi2.cdf(chi2_stat, df=1)
print(f'χ² = {chi2_stat:.3f}, p = {p_value:.3f}')
## χ² = 5.026, p = 0.025
The chi-squared test for trend gives a p-value of less than 0.05 so we reject the null hypothesis in favour of the alternative. We conclude that there is evidence that the proportion of students satisfied with assignment feedback is truly related to the size of the class attended. As the size of the class attended increases, the proportion of satisfied students decreases.
This is consistent with the plot of the data which shows a downward trend. However, unlike linear regression, this hypothesis test does not provide an estimate of the constant rate of decrease of the proportion, ie an estimate of the gradient of the plot.
Consider a new example:
A company manufactures widgets. The mass of the widgets that are produced is controlled through batch testing: rather than weighing every single one they take a batch and weigh those instead. If the batch as a whole is too light or too heavy, then all of the widgets in that batch are sent for re-manufacture. The company’s database shows that 90% of all batches are made within the acceptable mass range, while 5% are too heavy and 5% are too light.
If the company were to install new manufacturing equipment they would want to know if they could expect this same distribution of acceptance and rejection. If they tested the first 1000 batches after installing the new equipment and 30 were too light and 60 were to heavy, would this be the case?
The problem above suggests that the observed numbers of too light, acceptable and too heavy batches was [30, 910, 60]
whereas the expected numbers were [50, 900, 50]
, respectively.
The null and alternative hypotheses are as follows:
def perform_chi_squared(obs, exp):
"""
Perform a chi-squared test and interpret the p-value.
The p-value is the probability of obtaining test results at least as
extreme as the result actually observed, under the assumption that the null
hypothesis is correct.
"""
chisq, p = st.chisquare(obs, exp)
if p <= 0.001:
significance = '***'
interpretation = 'H_0 is rejected at a 99.9% confidence level.'
elif p <= 0.01:
significance = '**'
interpretation = 'H_0 is rejected at a 99% confidence level.'
elif p <= 0.05:
significance = '*'
interpretation = 'H_0 is rejected at a 95% confidence level.'
else:
significance = ''
interpretation = 'H_0 is not rejected'
return p, significance, interpretation
observations = [30, 910, 60]
expected = [50, 900, 50]
p, sig, inter = perform_chi_squared(observations, expected)
print(f'p = {p:.3f}{sig}. {inter}')
## p = 0.006**. H_0 is rejected at a 99% confidence level.
Thus, at a 99% confidence level, we can say that the new equipment has a different distribution of overweight-underweight widget production compared to the old equipment.