The code on this page uses the numpy
, scipy
, pandas
, statsmodels
and matplotlib
packages. These can be installed from the terminal with:
$ python3.11 -m pip install numpy
$ python3.11 -m pip install scipy
$ python3.11 -m pip install pandas
$ python3.11 -m pip install statsmodels
$ python3.11 -m pip install matplotlib
where python3.11
corresponds to the version of Python you have installed and are using.
Import these packages into Python with:
import numpy as np
from scipy import stats as st
import pandas as pd
import statsmodels.api as sm
from matplotlib import pyplot as plt
For the statistical tests of normality, the hypotheses used are:
We’ll use a significance level of 5% for the entire page:
# Chosen the p-value below which will indicate significance
alpha = 0.05
Set a seed for the random numbers:
# Set the seed so that we get the same random numbers each time this code runs
np.random.seed(20230811)
Create and plot the example data:
# Draw random numbers from the standard normal distribution
mu, sigma = 0, 1
data = np.random.normal(mu, sigma, 1000)
# Plot the random numbers
label = '1000 Random Numbers'
plt.hist(data, density=True, color='gray', label=label)
# Plot the exact distribution
x = np.linspace(-3, 3, 100)
y = st.norm.pdf(x, mu, sigma)
plt.plot(x, y, 'k--', label='Standard Normal Distribution')
# Format
plt.title('Example Data')
plt.ylabel('Probability Density')
plt.xlabel('Random Variables')
plt.ylim([0, 0.5])
plt.xlim([-3, 3])
plt.legend(frameon=False)
plt.show()
See the documentation here and the Wikipedia page here.
statistic, p_value = st.normaltest(data)
if p_value > alpha:
result = 'Normal'
else:
result = 'NOT Normal'
print(f"D'Agostino & Pearson: {result:>11s}")
print(statistic)
## D'Agostino & Pearson: Normal
## 4.612690205727473
See the documentation here and the Wikipedia page here.
statistic, pvalue = st.jarque_bera(data)
if pvalue > alpha:
result = 'Normal'
else:
result = 'NOT Normal'
print(f'Jarque-Bera: {result:>20s}')
print(statistic)
## Jarque-Bera: Normal
## 3.880393602164786
Note that you cannot actually use the Kolmogorov-Smirnov test as a normality test; this function uses the Lilliefors test in the background (which is based on Kolmogorov-Smirnov).
See the documentation here (note that this is identical to the documentation for lilliefors()
) and the Wikipedia page here.
ksstat, pvalue = sm.stats.diagnostic.kstest_normal(data)
if pvalue > alpha:
result = 'Normal'
else:
result = 'NOT Normal'
print(f'Kolmogorov-Smirnov: {result:>13s}')
print(ksstat)
## Kolmogorov-Smirnov: Normal
## 0.0184928859912723
The Lilliefors test is a Kolmogorov-Smirnov test with estimated parameters.
See the documentation here (note that this is identical to the documentation for kstest_normal()
) and the Wikipedia page here.
ksstat, pvalue = sm.stats.diagnostic.lilliefors(data)
if pvalue > alpha:
result = 'Normal'
else:
result = 'NOT Normal'
print(f'Lilliefors: {result:>21s}')
print(ksstat)
## Lilliefors: Normal
## 0.0184928859912723
Note that the test statistic (ksstat
) is the same as the one for the Kolmogorov–Smirnov test.
See the documentation here and the Wikipedia page here.
statistic, pvalue = st.shapiro(data)
if pvalue > alpha:
result = 'Normal'
else:
result = 'NOT Normal'
print(f'Shapiro-Wilk: {result:>19s}')
print(statistic)
## Shapiro-Wilk: Normal
## 0.9976267218589783
This tests for whether data comes from a particular distribution.
See the documentation here and the Wikipedia page here.
result = st.anderson(data)
idx = np.where(result.significance_level == alpha * 100)
if result.statistic < result.critical_values[idx]:
verdict = 'Normal'
else:
verdict = 'NOT Normal'
print(f'Anderson–Darling: {verdict:>15s}')
print(result.statistic)
## Anderson–Darling: Normal
## 0.4273934579834986
A probability plot shows how closely two data sets agree. Types of probability plots include:
SciPy’s probplot()
function can produce a probability plot, see the SciPy documentation here and the Wikipedia disambiguation page here.
st.probplot(data, dist='norm', plot=plt)
plt.show()
plt.close()
The easiest way to control the format of the plot is to take the objects that the function returns and plot those manually:
ax = plt.axes()
ax.set_title('Probability Plot')
ax.set_ylabel('Ordered Values')
ax.set_xlabel('Theoretical Quantiles')
ax.axline((0, 0), slope=1, c='gray', ls='--')
ax.plot(*st.probplot(data, dist='norm')[0], c='k', lw=3)
axis_lim = max([abs(min(data)), max(data)])
ax.set_xlim([-axis_lim, axis_lim])
ax.set_ylim([-axis_lim, axis_lim])
ax.set_aspect('equal')
ax.grid(ls='--')
plt.show()
plt.close()
A quantile-quantile plot shows the quantiles of two probability distributions. If the points lie close to or on the 45° line then it can be concluded that they are similar. This is a visual aid in testing for normality and should be checked in addition to the numerical test(s) because those do not tell you about behaviour at the tails.
Statsmodels has the qqplot()
function that will create a Q–Q plot, see the documentation here and the Wikipedia page here.
sm.qqplot(data, line='45')
plt.show()
plt.close()
Controlling the format here is a bit tricky. The easiest way is probably to use the ProbPlot()
function to create a probability plot that can then be plotted with qqplot()
and qqline()
to get a Q-Q plot. Use the ax
keyword argument to use a Matplotlib Axes object:
ax = plt.axes()
ax.set_title('Q-Q Plot')
ax.set_ylabel('Sample Quantiles')
ax.set_xlabel('Theoretical Quantiles')
pp = sm.ProbPlot(data)
qq = pp.qqplot(marker='.', markerfacecolor='gray', markeredgecolor='k', alpha=0.3, ax=ax)
sm.qqline(qq.axes[0], line='45', color='gray', linestyle='--')
axis_lim = max([abs(min(data)), max(data)])
ax.set_xlim([-axis_lim, axis_lim])
ax.set_ylim([-axis_lim, axis_lim])
ax.set_aspect('equal')
ax.grid(ls='--')
plt.show()
plt.close()
Taken from here.
dct = {
'Region': [
'Ajuno', 'Angahuan', 'Arantepacua', 'Aranza', 'Charapan', 'Cheran',
'Cocucho', 'Comachuen', 'Corupo', 'Ihuatzio', 'Janitzio', 'Jaracuaro',
'Nahuatzen', 'Nurio', 'Paracho', 'Patzcuaro', 'Pichataro',
'Pomacuaran', 'Quinceo', 'Quiroga', 'San Felipe', 'San Lorenzo',
'Sevina', 'Tingambato', 'Turicuaro', 'Tzintzuntzan', 'Urapicho'
],
'Population Density': [
5.11, 5.15, 5.00, 4.13, 5.10, 5.22, 5.04, 5.25, 4.53, 5.74, 6.63, 5.73,
4.77, 6.06, 4.82, 4.98, 5.36, 4.96, 5.94, 5.01, 4.10, 4.69, 4.97, 5.01,
6.19, 4.67, 6.30
]
}
df = pd.DataFrame(dct)
print(df)
## Region Population Density
## 0 Ajuno 5.11
## 1 Angahuan 5.15
## 2 Arantepacua 5.00
## 3 Aranza 4.13
## 4 Charapan 5.10
## 5 Cheran 5.22
## 6 Cocucho 5.04
## 7 Comachuen 5.25
## 8 Corupo 4.53
## 9 Ihuatzio 5.74
## 10 Janitzio 6.63
## 11 Jaracuaro 5.73
## 12 Nahuatzen 4.77
## 13 Nurio 6.06
## 14 Paracho 4.82
## 15 Patzcuaro 4.98
## 16 Pichataro 5.36
## 17 Pomacuaran 4.96
## 18 Quinceo 5.94
## 19 Quiroga 5.01
## 20 San Felipe 4.10
## 21 San Lorenzo 4.69
## 22 Sevina 4.97
## 23 Tingambato 5.01
## 24 Turicuaro 6.19
## 25 Tzintzuntzan 4.67
## 26 Urapicho 6.30
# Plot the data
data = df['Population Density']
plt.hist(data, density=True, color='gray', label='Population Densities')
# Plot a normal distribution
x_bar = np.mean(data)
s = np.std(data)
x = np.linspace(x_bar - 3 * s, x_bar + 3 * s, 100)
y = st.norm.pdf(x, x_bar, s)
plt.plot(x, y, 'k--', label='Normal Distribution')
# Axes
plt.xlim([x_bar - 3 * s, x_bar + 3 * s])
# Finish
plt.title('Population Densities in Mexican Regions')
plt.ylabel('Probability Density')
plt.xlabel('Binned Values')
plt.legend(frameon=False)
plt.show()
plt.close()
# D'Agostino's (and Pearson's) K² test
if st.normaltest(data)[1] > alpha:
print("D'Agostino & Pearson: Normal")
else:
print("D'Agostino & Pearson: NOT Normal")
# Jarque-Bera goodness-of-fit test
if st.jarque_bera(data)[1] > alpha:
print('Jarque-Bera: Normal')
else:
print('Jarque-Bera: NOT Normal')
# Lilliefors test (based on the Kolmogorov–Smirnov test)
if sm.stats.diagnostic.lilliefors(data)[1] > alpha:
print('Lilliefors: Normal')
else:
print('Lilliefors: NOT Normal')
# Shapiro–Wilk test
if st.shapiro(data)[1] > alpha:
print('Shapiro-Wilk: Normal')
else:
print('Shapiro-Wilk: NOT Normal')
# Anderson–Darling test
result = st.anderson(data)
if result[0] < result[1][2]:
print('Anderson–Darling: Normal')
else:
print('Anderson–Darling: NOT Normal')
## D'Agostino & Pearson: Normal
## Jarque-Bera: Normal
## Lilliefors: NOT Normal
## Shapiro-Wilk: Normal
## Anderson–Darling: NOT Normal
The source performs seven statistical normality tests, two of which are the same (Lilliefors and Kolmogorov-Smirnov) and two of which are not implemented in Python (the W/S test - not to be confused with the Shapiro–Wilk test - and D’Agostino’s D test - not to be confused with D’Agostino’s K² test). The test statistics of the remaining four tests match those given by Python:
print(f'Jarque–Bera test: {st.jarque_bera(data)[0]:.3f}')
print(f'Lilliefors test: {sm.stats.diagnostic.lilliefors(data)[0]:.3f}')
print(f'Shapiro–Wilk test: {st.shapiro(data)[0]:.3f}')
print(f'Anderson–Darling test: {result.statistic:.3f}')
## Jarque–Bera test: 1.209
## Lilliefors test: 0.173
## Shapiro–Wilk test: 0.943
## Anderson–Darling test: 0.764
So our results match the source.
ax = plt.axes()
ax.set_title('Probability Plot')
ax.set_ylabel('Ordered Values')
ax.set_xlabel('Theoretical Quantiles')
(osm, osr), (slope, intercept, r) = st.probplot(data, dist='norm')
ax.plot(osm, osr, c='k', lw=3)
ax.axline((0, intercept), slope=slope, c='gray', ls='--')
ax.set_ylim([min(osr), max(osr)])
ax.set_xlim([min(osm), max(osm)])
ax.grid(ls='--')
plt.show()
plt.close()
ax = plt.axes()
ax.set_title('Q-Q Plot')
ax.set_ylabel('Sample Quantiles')
ax.set_xlabel('Theoretical Quantiles')
pp = sm.ProbPlot(data)
qq = pp.qqplot(markerfacecolor='gray', markeredgecolor='k', ax=ax)
x = pp.theoretical_quantiles
y = pp.sample_quantiles
sm.qqline(qq.axes[0], x=x, y=y, line='s', color='gray', linestyle='--')
ax.set_xlim([min(x), max(x)])
ax.grid(ls='--')
plt.show()
plt.close()