Analysis of variance (ANOVA) is a category of hypothesis test that can be considered when comparing more than two groups of numerical data.
The code on this page uses the scikit-learn, Matplotlib, Seaborn and SciPy packages. These can be installed from the terminal with the following commands:
# `python3.11` corresponds to the version of Python you have installed and are using
$ python3.11 -m pip install scikit-learn
$ python3.11 -m pip install matplotlib
$ python3.11 -m pip install seaborn
$ python3.11 -m pip install scipy
Once finished, import these packages into your Python script as follows:
from sklearn import datasets
from matplotlib import pyplot as plt
from matplotlib import lines
import seaborn as sns
from scipy import stats as st
As an example, let’s use the wine recognition dataset, a toy dataset from scikit-learn.
This dataset can be loaded with:
# Load the dataset
wine = datasets.load_wine(as_frame=True)
# Extract the feature and target data together
df = wine['frame']
This contains the results of chemical analyses of 178 different wines grown in the same region in Italy by three different cultivators. For this example, let’s just look at the total phenols that were measured in each cultivator’s wines (phenolic content in wine is an important factor in it’s quality):
# Trim the data
cols = ['total_phenols', 'target']
df = df[cols]
# Rename the target
df = df.rename(columns={'target': 'cultivator'})
print(df.head())
## total_phenols cultivator
## 0 2.80 0
## 1 2.65 0
## 2 2.80 0
## 3 3.85 0
## 4 2.80 0
Only 5 rows are shown above (all from cultivator number 0) but there are 178 in total (from cultivators 0, 1 and 2):
# Number of rows
print(df.shape[0])
## 178
# Cultivators' names
print(df['cultivator'].unique())
## [0 1 2]
The best way to understand the raw data is to see all of it. Use a scatter plot to do this, with a boxplot included underneath to show the distribution:
# Plot
ax = plt.axes()
sns.boxplot(
df, x='cultivator', y='total_phenols', color='lightgrey', whis=[0, 100],
showmeans=True, meanline=True, meanprops={'color': 'black'}
)
sns.stripplot(
df, x='cultivator', y='total_phenols',
color='lightgrey', edgecolor='black', linewidth=1
)
title = """The Phenolic Content in 178 Wines
from 3 Different Italian Cultivators"""
ax.set_title(title)
ax.set_ylabel('Total Phenols')
# ax.set_ylim([0, 0.17])
ax.set_xlabel('')
ax.set_xticklabels(['Cultivator 0', 'Cultivator 1', 'Cultivator 2'])
handles = [lines.Line2D([0], [0], color='k', linewidth=1, linestyle='--')]
ax.legend(handles, ['Group Means'], loc='upper right')
plt.show()
Looking at the plotted data above raises a question: is the mean phenolic content different between the three cultivators? If it is, it would imply that the quality of the three winemakers is different (assuming, of course, that quality of wine is objective)!
First, we need to decide if we should use parametric or non-parametric statistics to address this research question. Parametric statistics are based on assumptions about the data’s distribution while non-parametric statistics are not. You should choose parametric statistics if:
Let’s check if we have sufficiently large groups:
# Sample sizes
n = df['cultivator'].value_counts()
print(n)
## cultivator
## 1 71
## 0 59
## 2 48
## Name: count, dtype: int64
We have three groups and \(n > 15\) in each. Taking a look at the box plot we can see that the groups aren’t considerably skewed (the means are close to the medians) and that there are not really any influential outliers. Hence, a parametric test is appropriate for assessing this research question.
The spread of the data should be roughly equal for all groups. Despite being called homogeneity of variance you can actually look at either variance or standard deviation to assess this. This is because the ratio of variances is - by definition - the square of the ratio of standard deviations and the square is a monotonic function, so order is preserved (for positive numbers).
With that said, here are some rules-of-thumb as to what standard deviations can be considered ‘roughly equal’:
where \(s_0\), \(s_1\) are the sample standard deviations for the two groups. Let’s check our data:
# Homogeneity of variance (using sample standard deviations)
s = df.groupby('cultivator')['total_phenols'].std(ddof=1)
for i in range(3):
for j in range(i + 1, 3):
ratio = s[i] / s[j]
print(f'Ratio of standard deviations: {ratio:.2f}')
## Ratio of standard deviations: 0.62
## Ratio of standard deviations: 0.95
## Ratio of standard deviations: 1.53
All ratios are between 0.5 and 2 so the standard deviations are ‘equal enough’.
We have enough insight into our research question (our ‘draft hypothesis’) and our data to be able to now follow a statistical test selection process:
Looking at the flowchart below tells us that we should thus be using ANOVA to test our hypothesis:
In order to answer the research question (“is the mean phenolic content different between the three cultivators?”) we now need to formulate it properly as a null hypothesis with a corresponding alternate hypothesis:
More mathematically:
where \(\mu_0\), \(\mu_1\), \(\mu_2\) are the true means (of all wines in the population, not just those in the samples) for the three groups and \(\lor\) is the logical “or”. Notice that this is a two-tailed test: we’re interested in finding any difference between means if one exists, not just a specific difference (\(\mu_0 > \mu_1\) or \(\mu_0 < \mu_1\), or similar for the other combinations of groups).
Now that we’ve chosen a hypothesis test the act of actually performing it is incredibly simple. Firstly, split the data up into one series of data per group:
# Samples
sample_0 = df[df['cultivator'] == 0]['total_phenols']
sample_1 = df[df['cultivator'] == 1]['total_phenols']
sample_2 = df[df['cultivator'] == 2]['total_phenols']
Then use the one-way ANOVA F-test from SciPy:
# One-way ANOVA
f_statistic, p_value = st.f_oneway(sample_0, sample_1, sample_2)
print(f'One-way ANOVA: F = {f_statistic:.2f}, p = {p_value:.2e}')
## One-way ANOVA: F = 93.73, p = 2.14e-28
The “F-test” is the actual test being performed in the background during the ANOVA.
This p-value is incredibly small, which means we can reject \(H_0\) and conclude that the true mean phenolic content levels are not the same for all three cultivators. Often, you will see this reported using asterisks to indicate the significance level (α) associated with the result. Additionally, if the p-value is very small (like it is here) it’s usually just reported as “<0.001” rather than as an exact value. Here are functions to add in this formatting:
def get_significance(p):
"""Returns the significance of a p-values as a string of stars."""
if p <= 0.001:
return '***'
elif p <= 0.01:
return '**'
elif p <= 0.05:
return '*'
elif p <= 0.1:
return '.'
else:
return ''
def round_p_value(p):
"""Round a small p-value so that it is human-readable."""
if p < 0.001:
return '<0.001'
else:
return f'{p:5.3}'
p_rounded = round_p_value(p_value)
significance = get_significance(p_value)
print(f'The p-value is {p_rounded} ({significance})')
## The p-value is <0.001 (***)
Another way to address the research question would have been to look at the differences between each pair of groups in turn. In other words, we could have performed three two-sample t-tests to compare cultivator 0 against cultivator 1, cultivator 0 against cultivator 2 and cultivator 1 against cultivator 2. That is not fundamentally an incorrect approach but it can inadvertently lead to p-hacking or false findings and should be avoided. It would be less efficient than performing one ANOVA anyway, especially if we had had more than three groups (with 4 groups, 6 t-tests are required; with 5 groups, 10 tests; with 6 groups, 15 tests and so on).
However, the ANOVA test only answers the question of whether the means of the groups are the same. It won’t tell us which of the groups are different (or how many groups are different if there are lots) or if it/they are higher or lower than the others. You need to analyse the data further - eg by performing a Tukey’s range test which is demonstrated over here - if you get a significant result from an ANOVA.