1 Example Data
2 Exploring the Data
3 Defining the Model
- 3.1 Null Model
- 3.2 Alternative Model
4 Fitting the Model

Quadratic regression fits a parabolic line-of-best-fit to continuous numerical data that has one dependent variable (it is a ‘univariate’ linear model), one independent variable and one group. In other words:

We need to have continuous data (ie numbers) that come from measurements, not from ratings or scores
A parabolic relationship is fitted to this data. And there is only one group which means that there is no way to look at the differences between groups.
There is exactly one independent variable and it is a true independent variable (which means that all other variables are being controlled for and stay the same when this variable is being read)
The shape of the data when plotted suggests that a parabolic relationship is appropriate, as opposed to a straight line or some other curve

If the above criteria are met then, as we can see from the flowchart below, quadratic regression is needed.

Python Packages:

The code on this page uses the sklearn, matplotlib, numpy, scipy, pandas and statsmodels packages. These can be installed from the terminal with:

$ python3.11 -m pip install sklearn
$ python3.11 -m pip install matplotlib
$ python3.11 -m pip install numpy
$ python3.11 -m pip install scipy
$ python3.11 -m pip install pandas
$ python3.11 -m pip install statsmodels

where python3.11 corresponds to the version of Python you have installed and are using.

1 Example Data

Use the ‘Linnerud’ example dataset from Scikit-learn. This contains physical exercise and physiological data as measured on 20 volunteers from a fitness club:

from sklearn import datasets

# Load the dataset
linnerud = datasets.load_linnerud(as_frame=True)

Only two of these variables will be used in this example: the number of situps completed and waist circumference in inches:

x = linnerud['data']['Situps']
y = linnerud['target']['Waist']

2 Exploring the Data

Use graphs and descriptive statistics to make an initial decision as to what summary model to use:

import matplotlib.pyplot as plt

# Make figures A6 in size
A = 6
plt.rc('figure', figsize=[46.82 * .5**(.5 * A), 33.11 * .5**(.5 * A)])
# Image quality
plt.rc('figure', dpi=141)
# Be able to add Latex
plt.rc('text', usetex=True)
plt.rc('font', family='serif')
plt.rc('text.latex', preamble=r'\usepackage{textgreek}')

# Plot
plt.scatter(x, y, c='k', s=10)
plt.title('Linnerud Dataset')
plt.xlabel('Sit-Ups')
plt.ylabel('Waist Circumference [in]')

Descriptive statistics:

import numpy as np

# Sample size
n = x.shape[0]
# Mean
y_bar = y.mean()
# Sample standard deviation
s = y.std(ddof=1)
# Standard error of the mean
sem = s / np.sqrt(n)

print(f'n = {n}; mean = {y_bar:.1f} inches, std dev = {s:.2f} inches, std error of the mean = {sem:.3f}')

## n = 20; mean = 35.4 inches, std dev = 3.20 inches, std error of the mean = 0.716

Calculate the 95% confidence interval:

from scipy import stats

# Significance level
alpha = 0.05
# Degrees of freedom
dof = n - 1
# Two-tailed test
tails = 2
# Percent-point function (aka quantile function) of the t-distribution
t = stats.t.ppf(1 - (alpha / tails), dof)
# Intervals
upper_ci = y_bar + t * sem
lower_ci = y_bar - t * sem

print(f'The sample mean is {y_bar:.1f} inches, 95% CI [{lower_ci:.1f}, {upper_ci:.1f}]')

## The sample mean is 35.4 inches, 95% CI [33.9, 36.9]

You can be 95% confident that the population mean falls between 33.9 and 36.9 inches.

3 Defining the Model

Having a look at the data, it seems that a quadratic regression model (a parabolic line) might be appropriate here. Our alternative model is thus a quadratic relationship between someone’s waist circumference and the number of sit-ups they can complete, whereas the null model is a linear relationship. Both of these models can be written in the form ‘Data = Pattern + Residual’ as follows:

Null model: y = α + βx + ε
Alternative model: y = α + βx + γx² + ε

where y is the ‘data’ ie the dependent variable (in this case, waist circumference), ε is the unexplained ‘residual’ (AKA ‘noise’) and α + βx and α + βx + γx² are the possible ‘patterns’ (α, β and γ being the coefficients of the polynomial line-of-best-fit). The hypotheses associated with these models are:

Null hypothesis, H₀: γ = 0
Alternative hypothesis, H₁: γ ≠ 0

Here’s what the models look like graphically:

3.1 Null Model

There is a straight-line relationship y = α + βx + ε (waist circumference = number of sit-ups + unexplained residual) between the response variable y and explanatory variable x:

m, c, r, p, se = stats.linregress(x, y)

plt.scatter(x, y, c='k', s=10)
xlim = plt.xlim()
plt.plot(np.array(xlim), m * np.array(xlim) + c, c='lightblue', ls='--', ms=5)
plt.xlim(xlim)
plt.title('Null Model')
plt.xlabel('Sit-Ups')
plt.ylabel('Waist Circumference [in]')

3.2 Alternative Model

The alternative model states that there is a quadratic relationship between sit-ups completed and waist circumference: y = α + βx + γx² + ε (waist circumference = sit-ups completed + (sit-ups completed)² + unexplained residual):

params, residuals, rank, singular_values, rcond = np.polyfit(x, y, 2, full=True)

plt.scatter(x, y, c='k', s=10)
xlim = plt.xlim()
x_fitted = np.linspace(xlim[0], xlim[1], 100)
y_fitted = params[2] + x_fitted * params[1] + x_fitted**2 * params[0]
plt.plot(x_fitted, y_fitted, c='lightblue', ls='--', ms=5)
plt.xlim(xlim)
plt.title('Alternative Model')
plt.xlabel('Sit-Ups')
plt.ylabel('Waist Circumference [in]')

4 Fitting the Model

We now need to find values for the parameters α, β and γ:

4.1 Using SciPy

from scipy import odr, stats

# Orthogonal distance regression
data = odr.Data(x, y)
odr_obj = odr.ODR(data, odr.quadratic)
output = odr_obj.run()
# Sample size
n = x.shape[0]
# Number of parameters
n_params = 3  # We have 3 parameters: α, β and γ
# Degrees of freedom
dof = n - n_params  # Sample size - number of parameters
# Confidence level
alpha = 0.05
# Two-sided test
tails = 2
# Percent-point function (aka quantile function) of the t-distribution
t_critical = stats.t.ppf(1 - alpha / tails, dof)
# Margin of error
d = t_critical * output.sd_beta
# Confidence interval
ci_upper = output.beta + d
ci_lower = output.beta - d
# Define the parameters
symbols = ['γ', 'β', 'α']

for i in range(n_params):
    s = symbols[i]
    v = output.beta[i]
    se = output.sd_beta[i]
    margin = d[i]
    cu = ci_upper[i]
    cl = ci_lower[i]
    print(f'{s} = {v:7.4f} ± {margin:.4f}, 95% CI [{cl:7.04f} to {cu:7.4f}]; se = {se:.4f}')

## γ =  0.0003 ± 0.0004, 95% CI [-0.0000 to  0.0007]; se = 0.0002
## β = -0.1389 ± 0.1123, 95% CI [-0.2513 to -0.0266]; se = 0.0532
## α = 46.9428 ± 7.6021, 95% CI [39.3407 to 54.5449]; se = 3.6032

Plot:

γ = output.beta[0]
β = output.beta[1]
α = output.beta[2]

# Create the plot
plt.scatter(x, y, c='k', s=10, label='Raw data')
# Quadratic line of best fit
xlim = plt.xlim()
x2 = np.linspace(xlim[0], xlim[1], 100)
plt.plot(x2, α + β * x2 + γ * x2**2, label='Quadratic line-of-best-fit')
# Title and labels
plt.title('Quadratic Regression')
plt.xlabel('Sit-Ups')
plt.ylabel('Waist Circumference [in]')
# Finished
plt.legend(fontsize=8)
plt.xlim(xlim)
plt.show()

4.2 Using NumPy

x = linnerud['data']['Situps']
y = linnerud['target']['Waist']

# Fit a polynomial of degree 2 (a quadratic)
params, residuals, rank, singular_values, rcond = np.polyfit(x, y, 2, full=True)

# Define the parameters
symbols = ['γ', 'β', 'α']
for i in range(n_params):
    s = symbols[i]
    v = params[i]
    print(f'{s} = {v:7.4f}')

## γ =  0.0003
## β = -0.1376
## α = 46.8531

# Fit the model
model = np.poly1d(params)
y_model = model(x)
y_bar = np.mean(y)
r_squared = np.sum((y_model - y_bar)**2) / np.sum((y - y_bar)**2)

print(f'Coefficient of determination, R² = {r_squared:.3f}')

## Coefficient of determination, R² = 0.527

4.3 Using Statsmodels

The Statsmodels method requires some re-formatting of the data:

Everything must be in one data frame
In addition to ‘sit-ups’, there needs to be a column with the number of sit-ups squared. This will form the γx² component of the quadratic polynomial

Here’s the data re-formatting:

import pandas as pd

# Merge on indexes
df = pd.merge(y, x, left_index=True, right_index=True)
linnerud['data']['Situps_sq'] = linnerud['data']['Situps']**2
df = pd.merge(df, linnerud['data']['Situps_sq'], left_index=True, right_index=True)

Now we can fit the quadratic regression model using ordinary least-squares (ols):

import statsmodels.formula.api as smf

model = smf.ols(formula='Waist ~ Situps + Situps_sq', data=df)
results = model.fit()

print(results.summary())

##                             OLS Regression Results                            
## ==============================================================================
## Dep. Variable:                  Waist   R-squared:                       0.527
## Model:                            OLS   Adj. R-squared:                  0.471
## Method:                 Least Squares   F-statistic:                     9.454
## Date:                Wed, 05 Jul 2023   Prob (F-statistic):            0.00174
## Time:                        03:37:07   Log-Likelihood:                -43.664
## No. Observations:                  20   AIC:                             93.33
## Df Residuals:                      17   BIC:                             96.31
## Df Model:                           2                                         
## Covariance Type:            nonrobust                                         
## ==============================================================================
##                  coef    std err          t      P>|t|      [0.025      0.975]
## ------------------------------------------------------------------------------
## Intercept     46.8531      3.608     12.987      0.000      39.242      54.465
## Situps        -0.1376      0.053     -2.579      0.020      -0.250      -0.025
## Situps_sq      0.0003      0.000      1.985      0.063   -2.16e-05       0.001
## ==============================================================================
## Omnibus:                        2.212   Durbin-Watson:                   2.360
## Prob(Omnibus):                  0.331   Jarque-Bera (JB):                1.457
## Skew:                           0.658   Prob(JB):                        0.483
## Kurtosis:                       2.866   Cond. No.                     2.16e+05
## ==============================================================================
## 
## Notes:
## [1] Standard Errors assume that the covariance matrix of the errors is correctly specified.
## [2] The condition number is large, 2.16e+05. This might indicate that there are
## strong multicollinearity or other numerical problems.

Specifically, here are the p-values for the model’s coefficients:

print(results.pvalues)

## Intercept    2.974737e-10
## Situps       1.950166e-02
## Situps_sq    6.346791e-02
## dtype: float64

We are only interested in the p-value for ‘Situps_sq’ (the gamma parameter, γ) when testing the null and alternative hypotheses given above (γ = 0 vs γ ≠ 0):

p = results.pvalues['Situps_sq']

print(f'p = {p:.3f}')

## p = 0.063

This is statistically significant at the 0.1 significance level, hence we fail to reject H₀ and conclude that the quadratic regression model adequately represents the relationship between sit-ups completed and waist circumference (p < 0.1).

⇦ Back

Statistics in Python:Quadratic Regression