A chi-squared test is used for discrete, categorical data. For example: medical data that includes people who are sick and people who are healthy, demographic data comparing people who are male and female, assessment data where people either passed or failed, etc. These tests are not relevant for continuous data, ie where each measurement/observation in the dataset is a ‘value’ (eg a percentage, a height in cm, a weight in kg, etc).
Also note that for a chi-squared test to be valid there must be more than 5 samples of each type.
There are three main chi-squared tests:
In R, all three are performed using the same function (chisq.test()
) from the base programme (ie you don’t need to install any packages). For more info about this function, see its documentation.
This test answers the question does my distribution of numbers come from the distribution I expect?
The following example comes from Wikipedia. If you randomly picked 100 people out of a population and 44 were male, is that population 50% male? You would have observed 44 males and 56 females, but your expectation would have been 50 males and 50 females. There are more than 5 samples in each group and the data is categorical (in this case it is binary - either male or female), as opposed to each data point being a measurement, so a chi-squared test can be considered. Take a look at how the information is entered into the chisquare()
function in order to test the goodness-of-fit of your sample to the expected distribution:
result <- chisq.test(c(44, 56), p = c(1 / 2, 1 / 2))
sprintf("p = %.2f", round(result$p.value, 2))
## [1] "p = 0.23"
As you can see, the chisq.test()
statement takes the numbers that were observed in a vector as the first argument and the proportions that were expected in a vector as the “p” keyword argument. The p-value of 0.23 suggests that there is a 23% chance that the population is 50% male/female.
As it happens, if you omit the “p” keyword argument the chisq.test()
function assumes you expect equal numbers of each type of observation. Therefore, in this example (because we expect equal numbers of males and females) we can actually leave the argument out:
result <- chisq.test(c(44, 56))
sprintf("p = %.2f", round(result$p.value, 2))
## [1] "p = 0.23"
If we observed 50 males and 50 females, the test would tell use that there is a 100% chance that the population is 50% male/female:
result <- chisq.test(c(50, 50))
sprintf("p = %.2f", round(result$p.value, 2))
## [1] "p = 1.00"
Note that the test does NOT work properly if you give it your data as proportions or percentages:
result <- chisq.test(c(0.44, 0.56))
sprintf("p = %.2f", round(result$p.value, 2))
## [1] "p = 0.90"
You must always provide the actual numbers observed and the proportions expected.
The following example comes from the Crash Course Youtube video:
Healer | Tank | Assassin | Fighter | |
---|---|---|---|---|
Observed | 25 | 35 | 50 | 90 |
Expected | 30 | 40 | 40 | 90 |
Again, the data is categorical and each ‘bin’ (cell in the table) has more than 5 samples so a chi-squared test is appropriate.
n <- 30 + 40 + 40 + 90
result <- chisq.test(c(25, 35, 50, 90), p = c(30 / n, 40 / n, 40 / n, 90 / n))
sprintf("p = %.1f", round(result$p.value, 1))
## [1] "p = 0.3"
Note that this example has 3 degrees of freedom. In general, the formula for degrees of freedom is (nrows - 1) * (ncols - 1)
:
nrows <- 2
ncols <- 4
df <- (nrows - 1) * (ncols - 1)
sprintf("Degrees of freedom = %d", df)
## [1] "Degrees of freedom = 3"
This example can also be done manually by calculating the chi-square test statistic by hand:
\(\chi^2 = \sum\frac{\left(observed - expected\right)^2}{expected}\)
chisq <- (25 - 30)^2 / 30 + (35 - 40)^2 / 40 + (50 - 40)^2 / 40 +
(90 - 90)^2 / 90
sprintf("chi-square = %.2f", round(chisq, 2))
## [1] "chi-square = 3.96"
Then the chi-square test statistic can be converted into a p-value using the chi-distribution for 3 degrees of freedom:
p <- pchisq(chisq, df = 3, lower.tail = FALSE)
sprintf("p = %.1f", round(p, 1))
## [1] "p = 0.3"
As the p-value is above 0.05 we fail to reject the null hypothesis that the distribution of characters chosen by players is as the developers claimed.
The chi-squared independence test answers the question do these two (or more) samples come from the same distribution?
Again using an example from the Crash Course Youtube video:
Observed:
Gryffindor | Hufflepuff | Ravenclaw | Slytherin | |
---|---|---|---|---|
No | 79 | 122 | 204 | 74 |
Yes | 82 | 139 | 240 | 69 |
Expected:
Gryffindor | Hufflepuff | Ravenclaw | Slytherin | |
---|---|---|---|---|
No | 77.12 | 120.71 | 212.68 | 68.5 |
Yes | 83.88 | 131.29 | 231.32 | 74.5 |
In R, this is calculated in the same way as the previous example except a data frame is used to represent the ‘observed’ table:
df <- data.frame(
No = c(79, 122, 204, 74),
Yes = c(82, 130, 240, 69)
)
result <- chisq.test(df)
sprintf("p = %.1f", round(result$p.value, 1))
## [1] "p = 0.6"
As p is larger than 0.05 we fail to reject the null hypothesis that the distribution of Hogwarts houses is the same for both pineapple-on-pizza lovers and haters.
Using an example from Stat Trek:
Observed:
Rep | Dem | Ind | |
---|---|---|---|
Male | 200 | 150 | 50 |
Female | 250 | 300 | 50 |
df <- data.frame(
Male = c(200, 150, 50),
Female = c(250, 300, 50)
)
result <- chisq.test(df)
sprintf("p = %.4f", round(result$p.value, 4))
## [1] "p = 0.0003"
To quote: “since the p-value (0.0003) is less than the significance level (0.05), we cannot accept the null hypothesis. Thus, we conclude that there is a relationship between gender and voting preference”.
If we want to perform the independence test multiple times (eg for multiple subsets of your dataset) then it might be more efficient to write a function to do this.
As an example, we will use the same data from section 2.2 above, although we will first transpose it (swap the columns and rows, effectively ‘rotating’ the data frame):
df <- data.frame(
Rep = c(200, 250),
Dem = c(150, 300),
Ind = c(50, 50)
)
result <- chisq.test(df)
sprintf("p = %.4f", round(result$p.value, 4))
## [1] "p = 0.0003"
As you can see, transposing the data frame has no effect on the calculation (the p-value is still 0.0003029775, or 0.0003 when rounded). However, in this format it is easier to compare pairs of columns. This allows use to look for relationships between gender and voting preferences for pairs of parties, as opposed to all three parties together:
# Republicans and Democrats only
result <- chisq.test(df[c("Rep", "Dem")])
sprintf("p = %.4f", round(result$p.value, 4))
# Republicans and Independents only
result <- chisq.test(df[c("Rep", "Ind")])
sprintf("p = %.4f", round(result$p.value, 4))
# Democrats and Independents only
result <- chisq.test(df[c("Dem", "Ind")])
sprintf("p = %.4f", round(result$p.value, 4))
## [1] "p = 0.0008"
## [1] "p = 0.3691"
## [1] "p = 0.0025"
Now we can write a function to do the above combinations of comparisons automatically:
chisq.test_twocolumns <- function(df) {
colnames <- colnames(df)
combinations <- utils::combn(colnames, 2)
for (i in seq_len(dim(combinations)[2])) {
name1 <- combinations[1, i]
name2 <- combinations[2, i]
result <- chisq.test(df[c(name1, name2)])
cat(sprintf("p = %.4f\n", round(result$p.value, 4)))
}
}
chisq.test_twocolumns(df)
## p = 0.0008
## p = 0.3691
## p = 0.0025
The chi-squared homogeneity test answers the question do these two (or more) samples come from the same population?
The homogeneity test is that same as the independence test in practice (ie the way it is calculated in R is the same) but it is different conceptually:
The difference between a chi-squared independence test and a chi-squared homogeneity test is that, in the former, there is one group of people (ie one sample of the population) surveyed whereas in the second there are multiple groups surveyed.
As a result, the question you are asking about the equivalency of the sample distribution and the population distribution is different.
Using an example from DisplayR:
Observed:
Living alone | Living with others | |
---|---|---|
On a diet | 2 | 25 |
Watch what I eat and drink | 23 | 146 |
Whatever I feel like | 7 | 124 |
df <- data.frame(
Living_alone = c(25, 146, 124),
Living_with_others = c(2, 23, 7)
)
result <- chisq.test(df)
sprintf("p = %.3f", round(result$p.value, 3))
## [1] "p = 0.052"
To quote: “as this is greater than 0.05, by convention the conclusion is that the difference is due to sampling error, although the closeness of 0.05 to 0.052 makes this very much a ‘line ball’ conclusion.”
Consider a new example:
A company manufactures widgets. The mass of the widgets that are produced is controlled through batch testing: rather than weighing every single one they take a batch and weigh all of them instead. If the batch as a whole is too light or too heavy, then all of the widgets in that batch are sent for re-manufacture. The company’s database shows that 90% of all batches are made within the acceptable mass range, while 5% are too heavy and 5% are too light.
If the company were to install new manufacturing equipment they would want to know if they could expect this same distribution of acceptance and rejection. If they tested the first 1000 batches after installing the new equipment and 30 were too light and 60 were to heavy, would this be the case?
The problem above suggests that the observed numbers of too light, acceptable and too heavy batches was [30, 910, 60]
whereas the expected numbers were [50, 900, 50]
, respectively.
The null and alternative hypotheses are as follows:
perform_chi_squared <- function(obs, exp) {
n <- sum(obs)
result <- chisq.test(obs, p = exp / n)
p <- result$p.value
if (p <= 0.001) {
significance <- "***"
interpretation <- "H_1 is accepted at a 99.9% confidence level."
} else if (p <= 0.01) {
significance <- "**"
interpretation <- "H_1 is accepted at a 99% confidence level."
} else if (p <= 0.05) {
significance <- "*"
interpretation <- "H_1 is accepted at a 95% confidence level."
} else {
significance <- ""
interpretation <- "H_0 is accepted"
}
return <- list(p, significance, interpretation)
return(return)
}
observations <- c(30, 910, 60)
expected <- c(50, 900, 50)
result <- perform_chi_squared(observations, expected)
p <- result[1]
sig <- result[2]
int <- result[3]
sprintf("p = %.3f%s. %s", p, sig, int)
## [1] "p = 0.006**. H_1 is accepted at a 99% confidence level."
Thus, at a 99% confidence level, we can say that the new equipment has a different distribution of overweight-underweight widget production compared to the old equipment.