⇦ Back

Quadratic regression fits a parabolic line-of-best-fit to continuous numerical data that has one dependent variable (it is a ‘univariate’ linear model), one independent variable and one group. In other words:

If the above criteria are met then, as we can see from the flowchart below, quadratic regression is needed.

Python Packages:

The code on this page uses the sklearn, matplotlib, numpy, scipy, pandas and statsmodels packages. These can be installed from the terminal with:

$ python3.11 -m pip install sklearn
$ python3.11 -m pip install matplotlib
$ python3.11 -m pip install numpy
$ python3.11 -m pip install scipy
$ python3.11 -m pip install pandas
$ python3.11 -m pip install statsmodels

where python3.11 corresponds to the version of Python you have installed and are using.

1 Example Data

Use the ‘Linnerud’ example dataset from Scikit-learn. This contains physical exercise and physiological data as measured on 20 volunteers from a fitness club:

from sklearn import datasets

# Load the dataset
linnerud = datasets.load_linnerud(as_frame=True)

Only two of these variables will be used in this example: the number of situps completed and waist circumference in inches:

x = linnerud['data']['Situps']
y = linnerud['target']['Waist']

2 Exploring the Data

Use graphs and descriptive statistics to make an initial decision as to what summary model to use:

import matplotlib.pyplot as plt

# Make figures A6 in size
A = 6
plt.rc('figure', figsize=[46.82 * .5**(.5 * A), 33.11 * .5**(.5 * A)])
# Image quality
plt.rc('figure', dpi=141)
# Be able to add Latex
plt.rc('text', usetex=True)
plt.rc('font', family='serif')
plt.rc('text.latex', preamble=r'\usepackage{textgreek}')

# Plot
plt.scatter(x, y, c='k', s=10)
plt.title('Linnerud Dataset')
plt.xlabel('Sit-Ups')
plt.ylabel('Waist Circumference [in]')

Descriptive statistics:

import numpy as np

# Sample size
n = x.shape[0]
# Mean
y_bar = y.mean()
# Sample standard deviation
s = y.std(ddof=1)
# Standard error of the mean
sem = s / np.sqrt(n)

print(f'n = {n}; mean = {y_bar:.1f} inches, std dev = {s:.2f} inches, std error of the mean = {sem:.3f}')
## n = 20; mean = 35.4 inches, std dev = 3.20 inches, std error of the mean = 0.716

Calculate the 95% confidence interval:

from scipy import stats

# Significance level
alpha = 0.05
# Degrees of freedom
dof = n - 1
# Two-tailed test
tails = 2
# Percent-point function (aka quantile function) of the t-distribution
t = stats.t.ppf(1 - (alpha / tails), dof)
# Intervals
upper_ci = y_bar + t * sem
lower_ci = y_bar - t * sem

print(f'The sample mean is {y_bar:.1f} inches, 95% CI [{lower_ci:.1f}, {upper_ci:.1f}]')
## The sample mean is 35.4 inches, 95% CI [33.9, 36.9]

You can be 95% confident that the population mean falls between 33.9 and 36.9 inches.

3 Defining the Model

Having a look at the data, it seems that a quadratic regression model (a parabolic line) might be appropriate here. Our alternative model is thus a quadratic relationship between someone’s waist circumference and the number of sit-ups they can complete, whereas the null model is a linear relationship. Both of these models can be written in the form ‘Data = Pattern + Residual’ as follows:

  • Null model: y = α + βx + ε
  • Alternative model: y = α + βx + γx² + ε

where y is the ‘data’ ie the dependent variable (in this case, waist circumference), ε is the unexplained ‘residual’ (AKA ‘noise’) and α + βx and α + βx + γx² are the possible ‘patterns’ (α, β and γ being the coefficients of the polynomial line-of-best-fit). The hypotheses associated with these models are:

  • Null hypothesis, H₀: γ = 0
  • Alternative hypothesis, H₁: γ ≠ 0

Here’s what the models look like graphically:

3.1 Null Model

There is a straight-line relationship y = α + βx + ε (waist circumference = number of sit-ups + unexplained residual) between the response variable y and explanatory variable x:

m, c, r, p, se = stats.linregress(x, y)

plt.scatter(x, y, c='k', s=10)
xlim = plt.xlim()
plt.plot(np.array(xlim), m * np.array(xlim) + c, c='lightblue', ls='--', ms=5)
plt.xlim(xlim)
plt.title('Null Model')
plt.xlabel('Sit-Ups')
plt.ylabel('Waist Circumference [in]')

3.2 Alternative Model

The alternative model states that there is a quadratic relationship between sit-ups completed and waist circumference: y = α + βx + γx² + ε (waist circumference = sit-ups completed + (sit-ups completed)² + unexplained residual):

params, residuals, rank, singular_values, rcond = np.polyfit(x, y, 2, full=True)

plt.scatter(x, y, c='k', s=10)
xlim = plt.xlim()
x_fitted = np.linspace(xlim[0], xlim[1], 100)
y_fitted = params[2] + x_fitted * params[1] + x_fitted**2 * params[0]
plt.plot(x_fitted, y_fitted, c='lightblue', ls='--', ms=5)
plt.xlim(xlim)
plt.title('Alternative Model')
plt.xlabel('Sit-Ups')
plt.ylabel('Waist Circumference [in]')

4 Fitting the Model

We now need to find values for the parameters α, β and γ:

4.1 Using SciPy

from scipy import odr, stats

# Orthogonal distance regression
data = odr.Data(x, y)
odr_obj = odr.ODR(data, odr.quadratic)
output = odr_obj.run()
# Sample size
n = x.shape[0]
# Number of parameters
n_params = 3  # We have 3 parameters: α, β and γ
# Degrees of freedom
dof = n - n_params  # Sample size - number of parameters
# Confidence level
alpha = 0.05
# Two-sided test
tails = 2
# Percent-point function (aka quantile function) of the t-distribution
t_critical = stats.t.ppf(1 - alpha / tails, dof)
# Margin of error
d = t_critical * output.sd_beta
# Confidence interval
ci_upper = output.beta + d
ci_lower = output.beta - d
# Define the parameters
symbols = ['γ', 'β', 'α']

for i in range(n_params):
    s = symbols[i]
    v = output.beta[i]
    se = output.sd_beta[i]
    margin = d[i]
    cu = ci_upper[i]
    cl = ci_lower[i]
    print(f'{s} = {v:7.4f} ± {margin:.4f}, 95% CI [{cl:7.04f} to {cu:7.4f}]; se = {se:.4f}')
## γ =  0.0003 ± 0.0004, 95% CI [-0.0000 to  0.0007]; se = 0.0002
## β = -0.1389 ± 0.1123, 95% CI [-0.2513 to -0.0266]; se = 0.0532
## α = 46.9428 ± 7.6021, 95% CI [39.3407 to 54.5449]; se = 3.6032

Plot:

γ = output.beta[0]
β = output.beta[1]
α = output.beta[2]

# Create the plot
plt.scatter(x, y, c='k', s=10, label='Raw data')
# Quadratic line of best fit
xlim = plt.xlim()
x2 = np.linspace(xlim[0], xlim[1], 100)
plt.plot(x2, α + β * x2 + γ * x2**2, label='Quadratic line-of-best-fit')
# Title and labels
plt.title('Quadratic Regression')
plt.xlabel('Sit-Ups')
plt.ylabel('Waist Circumference [in]')
# Finished
plt.legend(fontsize=8)
plt.xlim(xlim)
plt.show()

4.2 Using NumPy

x = linnerud['data']['Situps']
y = linnerud['target']['Waist']

# Fit a polynomial of degree 2 (a quadratic)
params, residuals, rank, singular_values, rcond = np.polyfit(x, y, 2, full=True)

# Define the parameters
symbols = ['γ', 'β', 'α']
for i in range(n_params):
    s = symbols[i]
    v = params[i]
    print(f'{s} = {v:7.4f}')
## γ =  0.0003
## β = -0.1376
## α = 46.8531
# Fit the model
model = np.poly1d(params)
y_model = model(x)
y_bar = np.mean(y)
r_squared = np.sum((y_model - y_bar)**2) / np.sum((y - y_bar)**2)

print(f'Coefficient of determination, R² = {r_squared:.3f}')
## Coefficient of determination, R² = 0.527

4.3 Using Statsmodels

The Statsmodels method requires some re-formatting of the data:

  • Everything must be in one data frame
  • In addition to ‘sit-ups’, there needs to be a column with the number of sit-ups squared. This will form the γx² component of the quadratic polynomial

Here’s the data re-formatting:

import pandas as pd

# Merge on indexes
df = pd.merge(y, x, left_index=True, right_index=True)
linnerud['data']['Situps_sq'] = linnerud['data']['Situps']**2
df = pd.merge(df, linnerud['data']['Situps_sq'], left_index=True, right_index=True)

Now we can fit the quadratic regression model using ordinary least-squares (ols):

import statsmodels.formula.api as smf

model = smf.ols(formula='Waist ~ Situps + Situps_sq', data=df)
results = model.fit()

print(results.summary())
##                             OLS Regression Results                            
## ==============================================================================
## Dep. Variable:                  Waist   R-squared:                       0.527
## Model:                            OLS   Adj. R-squared:                  0.471
## Method:                 Least Squares   F-statistic:                     9.454
## Date:                Wed, 05 Jul 2023   Prob (F-statistic):            0.00174
## Time:                        03:37:07   Log-Likelihood:                -43.664
## No. Observations:                  20   AIC:                             93.33
## Df Residuals:                      17   BIC:                             96.31
## Df Model:                           2                                         
## Covariance Type:            nonrobust                                         
## ==============================================================================
##                  coef    std err          t      P>|t|      [0.025      0.975]
## ------------------------------------------------------------------------------
## Intercept     46.8531      3.608     12.987      0.000      39.242      54.465
## Situps        -0.1376      0.053     -2.579      0.020      -0.250      -0.025
## Situps_sq      0.0003      0.000      1.985      0.063   -2.16e-05       0.001
## ==============================================================================
## Omnibus:                        2.212   Durbin-Watson:                   2.360
## Prob(Omnibus):                  0.331   Jarque-Bera (JB):                1.457
## Skew:                           0.658   Prob(JB):                        0.483
## Kurtosis:                       2.866   Cond. No.                     2.16e+05
## ==============================================================================
## 
## Notes:
## [1] Standard Errors assume that the covariance matrix of the errors is correctly specified.
## [2] The condition number is large, 2.16e+05. This might indicate that there are
## strong multicollinearity or other numerical problems.

Specifically, here are the p-values for the model’s coefficients:

print(results.pvalues)
## Intercept    2.974737e-10
## Situps       1.950166e-02
## Situps_sq    6.346791e-02
## dtype: float64

We are only interested in the p-value for ‘Situps_sq’ (the gamma parameter, γ) when testing the null and alternative hypotheses given above (γ = 0 vs γ ≠ 0):

p = results.pvalues['Situps_sq']

print(f'p = {p:.3f}')
## p = 0.063

This is statistically significant at the 0.1 significance level, hence we fail to reject H₀ and conclude that the quadratic regression model adequately represents the relationship between sit-ups completed and waist circumference (p < 0.1).

⇦ Back