Quadratic regression fits a parabolic line-of-best-fit to continuous numerical data that has one dependent variable (it is a ‘univariate’ linear model), one independent variable and one group. In other words:
If the above criteria are met then, as we can see from the flowchart below, quadratic regression is needed.
Python Packages:
The code on this page uses the sklearn
, matplotlib
, numpy
, scipy
, pandas
and statsmodels
packages. These can be installed from the terminal with:
$ python3.11 -m pip install sklearn
$ python3.11 -m pip install matplotlib
$ python3.11 -m pip install numpy
$ python3.11 -m pip install scipy
$ python3.11 -m pip install pandas
$ python3.11 -m pip install statsmodels
where python3.11
corresponds to the version of Python you have installed and are using.
Use the ‘Linnerud’ example dataset from Scikit-learn. This contains physical exercise and physiological data as measured on 20 volunteers from a fitness club:
from sklearn import datasets
# Load the dataset
linnerud = datasets.load_linnerud(as_frame=True)
Only two of these variables will be used in this example: the number of situps completed and waist circumference in inches:
x = linnerud['data']['Situps']
y = linnerud['target']['Waist']
Use graphs and descriptive statistics to make an initial decision as to what summary model to use:
import matplotlib.pyplot as plt
# Make figures A6 in size
A = 6
plt.rc('figure', figsize=[46.82 * .5**(.5 * A), 33.11 * .5**(.5 * A)])
# Image quality
plt.rc('figure', dpi=141)
# Be able to add Latex
plt.rc('text', usetex=True)
plt.rc('font', family='serif')
plt.rc('text.latex', preamble=r'\usepackage{textgreek}')
# Plot
plt.scatter(x, y, c='k', s=10)
plt.title('Linnerud Dataset')
plt.xlabel('Sit-Ups')
plt.ylabel('Waist Circumference [in]')
Descriptive statistics:
import numpy as np
# Sample size
n = x.shape[0]
# Mean
y_bar = y.mean()
# Sample standard deviation
s = y.std(ddof=1)
# Standard error of the mean
sem = s / np.sqrt(n)
print(f'n = {n}; mean = {y_bar:.1f} inches, std dev = {s:.2f} inches, std error of the mean = {sem:.3f}')
## n = 20; mean = 35.4 inches, std dev = 3.20 inches, std error of the mean = 0.716
Calculate the 95% confidence interval:
from scipy import stats
# Significance level
alpha = 0.05
# Degrees of freedom
dof = n - 1
# Two-tailed test
tails = 2
# Percent-point function (aka quantile function) of the t-distribution
t = stats.t.ppf(1 - (alpha / tails), dof)
# Intervals
upper_ci = y_bar + t * sem
lower_ci = y_bar - t * sem
print(f'The sample mean is {y_bar:.1f} inches, 95% CI [{lower_ci:.1f}, {upper_ci:.1f}]')
## The sample mean is 35.4 inches, 95% CI [33.9, 36.9]
You can be 95% confident that the population mean falls between 33.9 and 36.9 inches.
Having a look at the data, it seems that a quadratic regression model (a parabolic line) might be appropriate here. Our alternative model is thus a quadratic relationship between someone’s waist circumference and the number of sit-ups they can complete, whereas the null model is a linear relationship. Both of these models can be written in the form ‘Data = Pattern + Residual’ as follows:
where y is the ‘data’ ie the dependent variable (in this case, waist circumference), ε is the unexplained ‘residual’ (AKA ‘noise’) and α + βx and α + βx + γx² are the possible ‘patterns’ (α, β and γ being the coefficients of the polynomial line-of-best-fit). The hypotheses associated with these models are:
Here’s what the models look like graphically:
There is a straight-line relationship y = α + βx + ε (waist circumference = number of sit-ups + unexplained residual) between the response variable y and explanatory variable x:
m, c, r, p, se = stats.linregress(x, y)
plt.scatter(x, y, c='k', s=10)
xlim = plt.xlim()
plt.plot(np.array(xlim), m * np.array(xlim) + c, c='lightblue', ls='--', ms=5)
plt.xlim(xlim)
plt.title('Null Model')
plt.xlabel('Sit-Ups')
plt.ylabel('Waist Circumference [in]')
The alternative model states that there is a quadratic relationship between sit-ups completed and waist circumference: y = α + βx + γx² + ε (waist circumference = sit-ups completed + (sit-ups completed)² + unexplained residual):
params, residuals, rank, singular_values, rcond = np.polyfit(x, y, 2, full=True)
plt.scatter(x, y, c='k', s=10)
xlim = plt.xlim()
x_fitted = np.linspace(xlim[0], xlim[1], 100)
y_fitted = params[2] + x_fitted * params[1] + x_fitted**2 * params[0]
plt.plot(x_fitted, y_fitted, c='lightblue', ls='--', ms=5)
plt.xlim(xlim)
plt.title('Alternative Model')
plt.xlabel('Sit-Ups')
plt.ylabel('Waist Circumference [in]')
We now need to find values for the parameters α, β and γ:
from scipy import odr, stats
# Orthogonal distance regression
data = odr.Data(x, y)
odr_obj = odr.ODR(data, odr.quadratic)
output = odr_obj.run()
# Sample size
n = x.shape[0]
# Number of parameters
n_params = 3 # We have 3 parameters: α, β and γ
# Degrees of freedom
dof = n - n_params # Sample size - number of parameters
# Confidence level
alpha = 0.05
# Two-sided test
tails = 2
# Percent-point function (aka quantile function) of the t-distribution
t_critical = stats.t.ppf(1 - alpha / tails, dof)
# Margin of error
d = t_critical * output.sd_beta
# Confidence interval
ci_upper = output.beta + d
ci_lower = output.beta - d
# Define the parameters
symbols = ['γ', 'β', 'α']
for i in range(n_params):
s = symbols[i]
v = output.beta[i]
se = output.sd_beta[i]
margin = d[i]
cu = ci_upper[i]
cl = ci_lower[i]
print(f'{s} = {v:7.4f} ± {margin:.4f}, 95% CI [{cl:7.04f} to {cu:7.4f}]; se = {se:.4f}')
## γ = 0.0003 ± 0.0004, 95% CI [-0.0000 to 0.0007]; se = 0.0002
## β = -0.1389 ± 0.1123, 95% CI [-0.2513 to -0.0266]; se = 0.0532
## α = 46.9428 ± 7.6021, 95% CI [39.3407 to 54.5449]; se = 3.6032
Plot:
γ = output.beta[0]
β = output.beta[1]
α = output.beta[2]
# Create the plot
plt.scatter(x, y, c='k', s=10, label='Raw data')
# Quadratic line of best fit
xlim = plt.xlim()
x2 = np.linspace(xlim[0], xlim[1], 100)
plt.plot(x2, α + β * x2 + γ * x2**2, label='Quadratic line-of-best-fit')
# Title and labels
plt.title('Quadratic Regression')
plt.xlabel('Sit-Ups')
plt.ylabel('Waist Circumference [in]')
# Finished
plt.legend(fontsize=8)
plt.xlim(xlim)
plt.show()
x = linnerud['data']['Situps']
y = linnerud['target']['Waist']
# Fit a polynomial of degree 2 (a quadratic)
params, residuals, rank, singular_values, rcond = np.polyfit(x, y, 2, full=True)
# Define the parameters
symbols = ['γ', 'β', 'α']
for i in range(n_params):
s = symbols[i]
v = params[i]
print(f'{s} = {v:7.4f}')
## γ = 0.0003
## β = -0.1376
## α = 46.8531
# Fit the model
model = np.poly1d(params)
y_model = model(x)
y_bar = np.mean(y)
r_squared = np.sum((y_model - y_bar)**2) / np.sum((y - y_bar)**2)
print(f'Coefficient of determination, R² = {r_squared:.3f}')
## Coefficient of determination, R² = 0.527
The Statsmodels method requires some re-formatting of the data:
Here’s the data re-formatting:
import pandas as pd
# Merge on indexes
df = pd.merge(y, x, left_index=True, right_index=True)
linnerud['data']['Situps_sq'] = linnerud['data']['Situps']**2
df = pd.merge(df, linnerud['data']['Situps_sq'], left_index=True, right_index=True)
Now we can fit the quadratic regression model using ordinary least-squares (ols):
import statsmodels.formula.api as smf
model = smf.ols(formula='Waist ~ Situps + Situps_sq', data=df)
results = model.fit()
print(results.summary())
## OLS Regression Results
## ==============================================================================
## Dep. Variable: Waist R-squared: 0.527
## Model: OLS Adj. R-squared: 0.471
## Method: Least Squares F-statistic: 9.454
## Date: Wed, 05 Jul 2023 Prob (F-statistic): 0.00174
## Time: 03:37:07 Log-Likelihood: -43.664
## No. Observations: 20 AIC: 93.33
## Df Residuals: 17 BIC: 96.31
## Df Model: 2
## Covariance Type: nonrobust
## ==============================================================================
## coef std err t P>|t| [0.025 0.975]
## ------------------------------------------------------------------------------
## Intercept 46.8531 3.608 12.987 0.000 39.242 54.465
## Situps -0.1376 0.053 -2.579 0.020 -0.250 -0.025
## Situps_sq 0.0003 0.000 1.985 0.063 -2.16e-05 0.001
## ==============================================================================
## Omnibus: 2.212 Durbin-Watson: 2.360
## Prob(Omnibus): 0.331 Jarque-Bera (JB): 1.457
## Skew: 0.658 Prob(JB): 0.483
## Kurtosis: 2.866 Cond. No. 2.16e+05
## ==============================================================================
##
## Notes:
## [1] Standard Errors assume that the covariance matrix of the errors is correctly specified.
## [2] The condition number is large, 2.16e+05. This might indicate that there are
## strong multicollinearity or other numerical problems.
Specifically, here are the p-values for the model’s coefficients:
print(results.pvalues)
## Intercept 2.974737e-10
## Situps 1.950166e-02
## Situps_sq 6.346791e-02
## dtype: float64
We are only interested in the p-value for ‘Situps_sq’ (the gamma parameter, γ) when testing the null and alternative hypotheses given above (γ = 0 vs γ ≠ 0):
p = results.pvalues['Situps_sq']
print(f'p = {p:.3f}')
## p = 0.063
This is statistically significant at the 0.1 significance level, hence we fail to reject H₀ and conclude that the quadratic regression model adequately represents the relationship between sit-ups completed and waist circumference (p < 0.1).