⇦ Back

Analysis of variance (ANOVA) is a category of hypothesis test that can be considered when comparing more than two groups of numerical data.

1 Python Packages

The code on this page uses the scikit-learn, Matplotlib, Seaborn and SciPy packages. These can be installed from the terminal with the following commands:

# `python3.11` corresponds to the version of Python you have installed and are using
$ python3.11 -m pip install scikit-learn
$ python3.11 -m pip install matplotlib
$ python3.11 -m pip install seaborn
$ python3.11 -m pip install scipy

Once finished, import these packages into your Python script as follows:

from sklearn import datasets
from matplotlib import pyplot as plt
from matplotlib import lines
import seaborn as sns
from scipy import stats as st

2 Example Data

As an example, let’s use the wine recognition dataset, a toy dataset from scikit-learn.

  • For this dataset’s description, see here
  • For this dataset’s documentation, see here
  • For an example, see here

This dataset can be loaded with:

# Load the dataset
wine = datasets.load_wine(as_frame=True)
# Extract the feature and target data together
df = wine['frame']

This contains the results of chemical analyses of 178 different wines grown in the same region in Italy by three different cultivators. For this example, let’s just look at the total phenols that were measured in each cultivator’s wines (phenolic content in wine is an important factor in it’s quality):

# Trim the data
cols = ['total_phenols', 'target']
df = df[cols]
# Rename the target
df = df.rename(columns={'target': 'cultivator'})

print(df.head())
##    total_phenols  cultivator
## 0           2.80           0
## 1           2.65           0
## 2           2.80           0
## 3           3.85           0
## 4           2.80           0

Only 5 rows are shown above (all from cultivator number 0) but there are 178 in total (from cultivators 0, 1 and 2):

# Number of rows
print(df.shape[0])
## 178
# Cultivators' names
print(df['cultivator'].unique())
## [0 1 2]

The best way to understand the raw data is to see all of it. Use a scatter plot to do this, with a boxplot included underneath to show the distribution:

# Plot
ax = plt.axes()
sns.boxplot(
   df, x='cultivator', y='total_phenols', color='lightgrey', whis=[0, 100],
   showmeans=True, meanline=True, meanprops={'color': 'black'}
)
sns.stripplot(
   df, x='cultivator', y='total_phenols',
   color='lightgrey', edgecolor='black', linewidth=1
)
title = """The Phenolic Content in 178 Wines
from 3 Different Italian Cultivators"""
ax.set_title(title)
ax.set_ylabel('Total Phenols')
# ax.set_ylim([0, 0.17])
ax.set_xlabel('')
ax.set_xticklabels(['Cultivator 0', 'Cultivator 1', 'Cultivator 2'])
handles = [lines.Line2D([0], [0], color='k', linewidth=1, linestyle='--')]
ax.legend(handles, ['Group Means'], loc='upper right')
plt.show()

3 Choose a Statistical Test

3.1 Set a Research Question

Looking at the plotted data above raises a question: is the mean phenolic content different between the three cultivators? If it is, it would imply that the quality of the three winemakers is different (assuming, of course, that quality of wine is objective)!

3.2 Parametric vs Non-Parametric

First, we need to decide if we should use parametric or non-parametric statistics to address this research question. Parametric statistics are based on assumptions about the data’s distribution while non-parametric statistics are not. You should choose parametric statistics if:

  • Your data is normally distributed, or
  • Your data is not normally distributed but the sample size is large. Some rules-of-thumb as to what can be considered ‘large’ are:
    • \(n > 20\) if you only have one group
    • \(n > 15\) in each group if you have 2 to 9 groups
    • \(n > 20\) in each group if you have 10 to 12 groups
  • The data should not be skewed. In other words, the mean should be representative of the central tendency and not be much different from the median.
    • When doing parametric statistics use the mean and when doing non-parametric statistics use the median
  • There should not be an influential number of outliers. While this is similar to the previous point about the data not being skewed bear in mind that it’s possible to have outliers both above and below the mean of the data, essentially cancelling out the skewness. Hence why a ‘lack of skewness’ and a ‘lack of outliers’ both need to be checked for.
  • Parametric tests usually have more statistical power than non-parametric tests and so are more likely to detect a significant difference when one truly exists. If there is a need to increase the power of a test then they can be considered over non-parametric tests.

Let’s check if we have sufficiently large groups:

# Sample sizes
n = df['cultivator'].value_counts()

print(n)
## cultivator
## 1    71
## 0    59
## 2    48
## Name: count, dtype: int64

We have three groups and \(n > 15\) in each. Taking a look at the box plot we can see that the groups aren’t considerably skewed (the means are close to the medians) and that there are not really any influential outliers. Hence, a parametric test is appropriate for assessing this research question.

3.3 Homogeneity of Variance

The spread of the data should be roughly equal for all groups. Despite being called homogeneity of variance you can actually look at either variance or standard deviation to assess this. This is because the ratio of variances is - by definition - the square of the ratio of standard deviations and the square is a monotonic function, so order is preserved (for positive numbers).

With that said, here are some rules-of-thumb as to what standard deviations can be considered ‘roughly equal’:

  • \(0.5 < \frac{s_0}{s_1} < 2\) (source: Wikipedia)
  • \(0.9 < \frac{s_0}{s_1} < 1.1\) (source: Codecademy)

where \(s_0\), \(s_1\) are the sample standard deviations for the two groups. Let’s check our data:

# Homogeneity of variance (using sample standard deviations)
s = df.groupby('cultivator')['total_phenols'].std(ddof=1)
for i in range(3):
    for j in range(i + 1, 3):
        ratio = s[i] / s[j]
        print(f'Ratio of standard deviations: {ratio:.2f}')
## Ratio of standard deviations: 0.62
## Ratio of standard deviations: 0.95
## Ratio of standard deviations: 1.53

All ratios are between 0.5 and 2 so the standard deviations are ‘equal enough’.

3.4 Use the Flowchart

We have enough insight into our research question (our ‘draft hypothesis’) and our data to be able to now follow a statistical test selection process:

  • Our dependent variable - total phenols - is continuous because it is made up of measurements that are numbers as opposed to discrete categories or ranks
  • Our independent variables - which cultivator produced the wine - is categorical (specifically, it is nominal)
  • We have three groups for the independent variable because there are three cultivators
  • As discussed above, we want to do a parametric test

Looking at the flowchart below tells us that we should thus be using ANOVA to test our hypothesis:

3.5 Formalise the Hypothesis

In order to answer the research question (“is the mean phenolic content different between the three cultivators?”) we now need to formulate it properly as a null hypothesis with a corresponding alternate hypothesis:

  • \(H_0:\) the true mean phenolic content levels are the same for all three cultivators
  • \(H_1:\) the true mean phenolic content levels are not the same for all three cultivators

More mathematically:

  • \(H_0: \mu_0 = \mu_1 = \mu_2\)
  • \(H_1: \mu_0 \neq \mu_1 \lor \mu_0 \neq \mu_2 \lor \mu_1 \neq \mu_2\)

where \(\mu_0\), \(\mu_1\), \(\mu_2\) are the true means (of all wines in the population, not just those in the samples) for the three groups and \(\lor\) is the logical “or”. Notice that this is a two-tailed test: we’re interested in finding any difference between means if one exists, not just a specific difference (\(\mu_0 > \mu_1\) or \(\mu_0 < \mu_1\), or similar for the other combinations of groups).

4 One-Way ANOVA Using an F-Test

Now that we’ve chosen a hypothesis test the act of actually performing it is incredibly simple. Firstly, split the data up into one series of data per group:

# Samples
sample_0 = df[df['cultivator'] == 0]['total_phenols']
sample_1 = df[df['cultivator'] == 1]['total_phenols']
sample_2 = df[df['cultivator'] == 2]['total_phenols']

Then use the one-way ANOVA F-test from SciPy:

# One-way ANOVA
f_statistic, p_value = st.f_oneway(sample_0, sample_1, sample_2)

print(f'One-way ANOVA: F = {f_statistic:.2f}, p = {p_value:.2e}')
## One-way ANOVA: F = 93.73, p = 2.14e-28

The “F-test” is the actual test being performed in the background during the ANOVA.

4.1 Interpreting the Result

This p-value is incredibly small, which means we can reject \(H_0\) and conclude that the true mean phenolic content levels are not the same for all three cultivators. Often, you will see this reported using asterisks to indicate the significance level (α) associated with the result. Additionally, if the p-value is very small (like it is here) it’s usually just reported as “<0.001” rather than as an exact value. Here are functions to add in this formatting:

def get_significance(p):
    """Returns the significance of a p-values as a string of stars."""
    if p <= 0.001:
        return '***'
    elif p <= 0.01:
        return '**'
    elif p <= 0.05:
        return '*'
    elif p <= 0.1:
        return '.'
    else:
        return ''


def round_p_value(p):
    """Round a small p-value so that it is human-readable."""
    if p < 0.001:
        return '<0.001'
    else:
        return f'{p:5.3}'


p_rounded = round_p_value(p_value)
significance = get_significance(p_value)

print(f'The p-value is {p_rounded} ({significance})')
## The p-value is <0.001 (***)

5 ANOVA vs the Two-Sample t-Test

Another way to address the research question would have been to look at the differences between each pair of groups in turn. In other words, we could have performed three two-sample t-tests to compare cultivator 0 against cultivator 1, cultivator 0 against cultivator 2 and cultivator 1 against cultivator 2. That is not fundamentally an incorrect approach but it can inadvertently lead to p-hacking or false findings and should be avoided. It would be less efficient than performing one ANOVA anyway, especially if we had had more than three groups (with 4 groups, 6 t-tests are required; with 5 groups, 10 tests; with 6 groups, 15 tests and so on).

However, the ANOVA test only answers the question of whether the means of the groups are the same. It won’t tell us which of the groups are different (or how many groups are different if there are lots) or if it/they are higher or lower than the others. You need to analyse the data further - eg by performing a Tukey’s range test which is demonstrated over here - if you get a significant result from an ANOVA.

⇦ Back