Simple linear regression fits a straight, slanted line-of-best-fit to continuous numerical data that has one dependent variable (it is a ‘univariate’ linear model), one independent variable and one group. In other words:

We need to have continuous data (ie numbers) that come from measurements, not from ratings or scores
A straight-line relationship is fitted to this data. And there is only one group which means that there is no way to look at the differences between groups.
There is exactly one independent variable and it is a true independent variable (which means that all other variables are being controlled for and stay the same when this variable is being read)
The shape of the data when plotted suggests that a straight-line relationship is appropriate, as opposed to a parabolic line or some other curve

If the above criteria are met then, as we can see from the flowchart below, simple linear regression is appropriate:

Python Packages:

The code on this page uses the sklearn, matplotlib, numpy, scipy, pandas and statsmodels packages. These can be installed from the terminal with:

$ python3.11 -m pip install sklearn
$ python3.11 -m pip install matplotlib
$ python3.11 -m pip install numpy
$ python3.11 -m pip install scipy
$ python3.11 -m pip install pandas
$ python3.11 -m pip install statsmodels

where python3.11 corresponds to the version of Python you have installed and are using.

1 Example Data

We will use the ‘diabetes’ example dataset from Scikit-learn. This contains measurements from 442 patients including 10 health-related variables and one ‘target’ variable: a quantitative measure of disease progression one year after baseline. For this example we will only use one of the 10 variables - body mass index (BMI) - and see if it alone can predict the target (‘outcome’) variable:

from sklearn import datasets

# Load the dataset as a 'bunch' object that contains a data frame
diabetes = datasets.load_diabetes(as_frame=True)
# Extract the data frame
df = diabetes['frame']
# We'll look at 2 columns:
# 'bmi': the body mass index
# 'target': a quantitative measure of disease progression one year after baseline
df = df[['bmi', 'target']]
# Undo the scaling factor that was applied
df['bmi'] = (df['bmi'] * 3.19 / df['bmi'].std(ddof=1)) + 24.4

print(df.head())

##          bmi  target
## 0  28.533029   151.0
## 1  20.951753    75.0
## 2  27.377787   141.0
## 3  23.623250   206.0
## 4  21.962589   135.0

2 Exploring the Data

Use graphs and descriptive statistics to make an initial decision as to what summary model to use:

2.1 Set the Format for the Plots

from matplotlib import pyplot as plt

# Make figures A6 in size
A = 6
plt.rc('figure', figsize=[46.82 * .5**(.5 * A), 33.11 * .5**(.5 * A)])
# Be able to add Latex
plt.rc('text', usetex=True)
plt.rc('font', family='serif')
plt.rc('text.latex', preamble=r'\usepackage{textgreek}')

2.2 Plot the Data

# Plot
plt.scatter(df['bmi'], df['target'], c='k', s=6)
plt.title('Diabetes Dataset')
plt.xlabel('BMI [kg/m²]')
plt.ylabel('Target Variable (Disease Progression)')

2.3 Descriptive Statistics

import numpy as np

# Sample size
n = df.shape[0]
# Mean
y_bar = df['target'].mean()
# Sample standard deviation
s = df['target'].std(ddof=1)
# Standard error of the mean
sem = s / np.sqrt(n)

print(f'n = {n}; mean = {y_bar:.1f}, std dev = {s:.2f}, std error of the mean = {sem:.2f}')

## n = 442; mean = 152.1, std dev = 77.09, std error of the mean = 3.67

Calculate the 95% confidence interval:

from scipy import stats

# Significance level
alpha = 0.05
# Degrees of freedom
dof = n - 1
# Two-tailed test
tails = 2
# Percent-point function (aka quantile function) of the t-distribution
t_critical = stats.t.ppf(1 - (alpha / tails), dof)
# Margin of error
d = t_critical * sem
# Confidence interval
upper_ci = y_bar + d
lower_ci = y_bar - d

print(f'The sample mean is {y_bar:.1f}, 95% CI [{lower_ci:.1f}, {upper_ci:.1f}].')

## The sample mean is 152.1, 95% CI [144.9, 159.3].

You can be 95% confident that the population mean falls between 144.9 and 159.3.

3 Defining the Model

Having a look at the data, it seems that a linear regression model (a straight, slanted line) might be appropriate here. Our alternative model is thus a linear relationship between the true change in disease progression and an increase in a patient’s BMI, whereas the null model has no such relationship and disease progression is merely equal to the average (mean) disease progression of the sample. Both of these models can be written in the form ‘Data = Pattern + Residual’ as follows:

Null model: \(y = \alpha + \epsilon\)
Alternative model: \(y = \alpha + \beta x + \epsilon\)

where \(y\) is the ‘data’ ie the dependent variable (in this case, disease progression), \(\epsilon\) is the unexplained ‘residual’ (AKA ‘noise’) and \(\alpha\) and \(\alpha + \beta x\) are the possible ‘patterns’: \(\alpha\) is the mean disease progression under the null model and \(\alpha + \beta x\) is a line-of-best-fit with gradient \(\beta\) and y-intercept \(\alpha\) under the alternative model. Note that, effectively, the sample mean is the null model. The hypotheses associated with these models are:

Null hypothesis, \(H_0: \beta = 0\)
Alternative hypothesis, \(H_1: \beta \neq 0\)

Here’s what the models look like graphically:

3.1 Null Model

The null model states that there is no relationship between the response variable \(y\) and explanatory variable \(x\), ie \(y = \alpha + \epsilon\) (or, equivalently, \(y = \alpha + \beta x + \epsilon\) with \(\beta = 0\)) where \(\alpha\) is the sample mean. In other words, disease progression = average disease progression + unexplained residual.

plt.scatter(df['bmi'], df['target'], c='k', s=6)
mean = df['target'].mean()
plt.axhline(mean, c='lightblue', ls='--', label=f'Mean = {mean:.1f}')
plt.title('Null Model')
plt.xlabel('BMI [kg/m²]')
plt.ylabel('Target Variable (Disease Progression)')
plt.legend(frameon=False, fontsize=9)

3.2 Alternative Model

The alternative model states that there is a relationship between BMI and disease progression, namely a straight-line regression \(y = \alpha + \beta x + \epsilon\) (disease progression = BMI + unexplained residual):

m, c, r, p, se = stats.linregress(df['bmi'], df['target'])

plt.scatter(df['bmi'], df['target'], c='k', s=6)
xlim = plt.xlim()
x = np.array(xlim)
y = m * np.array(xlim) + c
plt.plot(x, y, c='lightblue', ls='--', label='Line of Best Fit')
plt.xlim(xlim)
plt.title('Alternative Model')
plt.xlabel('BMI [kg/m²]')
plt.ylabel('Target Variable (Disease Progression)')
plt.legend(frameon=False, fontsize=9)

4 Fitting the Model

Each time you perform a simple linear regression analysis you are comparing two competing summary models: the null model and the straight-line regression model. Indeed, all hypothesis tests are a choice between two models: a two-sample t-test, for example, is a choice between using one or two means as the model for two groups of data (one mean for both groups or one mean for each group).

4.1 Test of Between-Subject Effects

For this, we can use the ordinary least-squares (OLS()) function from the Statsmodels package:

import statsmodels.api as sm

# Create and fit the ANOVA model
model = sm.OLS.from_formula('target ~ bmi', df)
results = model.fit()
aov = sm.stats.anova_lm(results, typ=2)

print(aov)

##                 sum_sq     df           F        PR(>F)
## bmi       9.014273e+05    1.0  230.653764  3.466006e-42
## Residual  1.719582e+06  440.0         NaN           NaN

The indices are the sources of between-subject effects (ie the factors): ‘bmi’ is the intercept parameter and ‘Residual’ is the error. The PR(>F) column is the p-value from, in this case, the statistical test of the null hypothesis only (there is only one non-NaN value). The sum of the columns is useful if you’re interested in the total between-subject effects:

print(aov.sum())

## sum_sq    2.621009e+06
## df        4.410000e+02
## F         2.306538e+02
## PR(>F)    3.466006e-42
## dtype: float64

Here is the relevant p-value:

p = aov.loc['bmi', 'PR(>F)']

print(f'p-value = {p:.3e}')

## p-value = 3.466e-42

4.2 Parameter Estimates

4.2.1 Using Statsmodels

Estimate the parameters of the model and get the associated p-values, again using ordinary least-squares from Statsmodels:

import statsmodels.api as sm

# Create and fit the model
X = sm.add_constant(df['bmi'])
model = sm.OLS(df['target'], X)
results = model.fit()

print(f'Intercept:\n{results.t_test([1, 0])}\n\n\nGradient:\n{results.t_test([0, 1])}')

## Intercept:
##                              Test for Constraints                             
## ==============================================================================
##                  coef    std err          t      P>|t|      [0.025      0.975]
## ------------------------------------------------------------------------------
## c0          -193.6826     22.963     -8.434      0.000    -238.814    -148.551
## ==============================================================================
## 
## 
## Gradient:
##                              Test for Constraints                             
## ==============================================================================
##                  coef    std err          t      P>|t|      [0.025      0.975]
## ------------------------------------------------------------------------------
## c0            14.1728      0.933     15.187      0.000      12.339      16.007
## ==============================================================================

Final results of the simple linear regression:

α = results.params[0]
β = results.params[1]

print(f'target = {α:.2f} {β:=+8.3f} × bmi')

## target = -193.68 + 14.173 × bmi

t = results.tvalues[1]
p = results.pvalues[1]

print(f't = {t:.3f}, p = {p:.3e}')

## t = 15.187, p = 3.466e-42

As \(p \lt 0.05\), we reject \(H_0 \left( \beta = 0 \right)\) in favour of \(H_1 \left( \beta \neq 0 \right)\) at significance level 0.05.

4.2.2 Using SciPy

Here’s how to do it using the linregress() function from SciPy:

from scipy import stats

# Simple linear regression
result = stats.linregress(df['bmi'], df['target'])

# Extract the results
β = result.slope
α = result.intercept
r = result.rvalue
p = result.pvalue
se_β = result.stderr
se_α = result.intercept_stderr

Results of the simple linear regression:

print(f'target = {α:.2f} {β:=+8.3f} × bmi')

## target = -193.68 + 14.173 × bmi

print(f'p = {p:.3e}')

## p = 3.466e-42

As \(p \lt 0.05\), we reject \(H_0 \left( \beta = 0 \right)\) in favour of \(H_1 \left( \beta \neq 0 \right)\) at significance level 0.05.

The t-statistic is calculated inside of linregress but, for the record, here it is:

t_stat = stats.t.ppf(1 - (p / 2), df.shape[0] - 2)

print(f't = {t_stat:.3f}')

## t = 15.187

4.2.3 Using Scikit-Learn

Scikit-learn as a package is geared towards machine learning so its method for doing simple linear regression involves ‘training’ a regression model. This example has been adapted from the one included in the documentation.

from sklearn import datasets, linear_model, metrics

# Separate out the data we need
X = df[['bmi']]
Y = df['target']

# Create a linear regression object
regr = linear_model.LinearRegression()

# Train a regression model using the raw data as a training set
ft = regr.fit(X, Y)

# Results of the simple linear regression
print(f'Intercept: {regr.intercept_:.2f}; coefficient: {regr.coef_[0]:.3f}')

## Intercept: -193.68; coefficient: 14.173

Use the predictions of the trained regression model:

predictions = regr.predict(X)

# The mean squared error
mse = metrics.mean_squared_error(Y, predictions)

print(f'Mean squared error: {mse:.2f}')

## Mean squared error: 3890.46

# The coefficient of determination
# (1 is perfect prediction)
R2 = metrics.r2_score(Y, predictions)

print(f'Coefficient of determination, R²: {R2:.3f}')

## Coefficient of determination, R²: 0.344

Calculate the p-values. This uses code taken from here:

from scipy import stats

# Re-format the results of the linear regression
params = np.append(regr.coef_, regr.intercept_)
# Sum of squared residuals
sum_sq_residuals = sum((Y - predictions)**2)
# Degrees of freedom (sample size - number of parameters)
dof = len(X) - len(params)
# Mean squared error
mse = sum_sq_residuals / dof
# Add a constant
X['Constant'] = 1
# Covariances of the parameters ('@'' is the dot product)
cov = mse * np.diagonal(np.linalg.inv(X.T @ X))
# Standard errors of the parameters
se = np.sqrt(cov)
# t-statistics associated with the parameters
t_statistics = params / se
# Two-tailed test
tails = 2
# Calculate the p-values
p_values = [tails * (1 - stats.t.cdf(np.abs(t), dof)) for t in t_statistics]

Display the results:

import pandas as pd

# Round-off the results
params = np.round(params, 4)
se = np.round(se, 3)
t_statistics = np.round(t_statistics, 3)
p_values = np.round(p_values, 3)

# Construct the output data frame
output = pd.DataFrame()
output['Parameters'] = ['β', 'α']
output['Values'] = params
output['Standard Errors'] = se
output['t-Values'] = t_statistics
output['Probabilities'] = p_values

print(output)

##   Parameters    Values  Standard Errors  t-Values  Probabilities
## 0          β   14.1728            0.933    15.187            0.0
## 1          α -193.6826           22.963    -8.434            0.0

As \(p \lt 0.05\) for the gradient \(\beta\), we reject \(H_0 \left(\beta = 0 \right)\) in favour of \(H_1 \left( \beta \neq 0 \right)\) at significance level 0.05.

4.3 Coefficient of Determination (R-Squared)

from scipy import stats

# Simple linear regression
result = stats.linregress(df['bmi'], df['target'])

# Extract the results
r = result.rvalue
R2 = r**2

print(f'R² = {R2:.3f}')

## R² = 0.344

\(R^2\) = 19.2%. This indicates that 19.2% of the total variability observed in disease progression has been explained by the straight-line regression on BMI. The higher \(R^2\) is, the more adherent the straight line is to the observed relationship between \(y\) and \(x\). In other words, \(R^2\) represents how much of the variability observed in \(y\) ignoring \(x\) (ie in the null regression) is explained by taking into account the change in \(x\) (ie by taking into account the straight-line regression).

\(R^2\) = 1 − (residual sum of squares / total sum of squares) where ‘total sum of squares’ is the residual variability around the null model and the ‘residual sum of squares’ is the residual variability around the straight-line regression.

4.4 Confidence Interval

Calculate the 95% confidence interval for the slope and the intercept using the two-sided inverse Student’s t-distribution:

from scipy import stats

# Sample size
n = df.shape[0]
# Number of parameters
n_params = 2  # We have 2 parameters: α and β
# Degrees of freedom
dof = n - n_params  # Sample size - number of parameters
# Significance level
alpha = 0.05
# Percent-point function (aka quantile function) of the t-distribution
t_critical = stats.t.ppf(1 - (alpha / 2), dof)

print(f't-critical: {t_critical:.3f}')

## t-critical: 1.965

Confidence interval for the gradient:

# Margin of error
d = t_critical * se_β

print(f'Gradient, β = {β:.3f} ± {d:.3f} (95% CI); se = {se_β:.3f}')

## Gradient, β = 14.173 ± 1.834 (95% CI); se = 0.933

ci_upper = β + d
ci_lower = β - d

print(f'Gradient, β = {β:.3f}, 95% CI [{ci_lower:.4f}, {ci_upper:.2f}]')

## Gradient, β = 14.173, 95% CI [12.3387, 16.01]

Confidence interval for the intercept:

# Margin of error
d = t_critical * se_α

print(f'Intercept, α = {α:.2f} ± {d:.1f} (95% CI); se = {se_α:.2f}')

## Intercept, α = -193.68 ± 45.1 (95% CI); se = 22.96

ci_upper = α + d
ci_lower = α - d

print(f'Intercept, α = {α:.2f}, 95% CI [{ci_lower:.1f}, {ci_upper:.1f}]')

## Intercept, α = -193.68, 95% CI [-238.8, -148.6]

4.5 Prediction Limits

An extension of simple linear regression, using NumPy:

import numpy as np

x = df['bmi']
y = df['target']

# Parameters from the fit of the 1D polynomial
params = np.polyfit(x, y, 1)

# Straight line regression
β = params[0]
α = params[1]

print(f'target = {α:.2f} {β:=+8.3f} × bmi')

## target = -193.68 + 14.173 × bmi

# Model using the fit parameters (ie using the coefficients)
y_model = np.polyval(params, x)
# Number of observations
n = y.size
# Number of parameters
m = p.size
# Degrees of freedom (number of observations - number of parameters)
dof = n - m
# Significance level
alpha = 0.05
# We're using a two-sided test
tails = 2
# Percent-point function (aka quantile function) of the t-distribution
t_critical = stats.t.ppf(1 - (alpha / tails), dof)
# Calculate the residuals (estimates of error in the data or the model)
resid = y - y_model
# Chi-squared; estimates error in data
chi2 = sum((resid / y_model)**2)
# Reduced chi-squared; measures goodness of fit
chi2_red = chi2 / dof
# Standard deviation of the error
std_err = np.sqrt(sum(resid**2) / dof)

Plot:

plt.scatter(df['bmi'], df['target'], c='k', s=6)
xlim = plt.xlim()
# Line of best fit
plt.plot(np.array(xlim), α + β * np.array(xlim), label='Linear Regression')
# Fit
x2 = np.linspace(xlim[0], xlim[1], 100)
y2 = np.polyval(params, x2)
# Confidence interval
ci = t_critical * std_err * np.sqrt(1 / n + (x2 - np.mean(x))**2 / np.sum((x - np.mean(x))**2))
plt.fill_between(
    x2, y2 + ci, y2 - ci, facecolor='#b9cfe7', zorder=0,
    label=r'95\% Confidence Interval'
)
# Prediction Interval
pi = t_critical * std_err * np.sqrt(1 + 1 / n + (x2 - np.mean(x))**2 / np.sum((x - np.mean(x))**2))
plt.plot(x2, y2 - pi, '--', color='0.5', label=r'95\% Prediction Limits')
plt.plot(x2, y2 + pi, '--', color='0.5')
# Title and labels
plt.title('Simple Linear Regression')
plt.xlabel('BMI [kg/m²]')
plt.ylabel('Target Variable (Disease Progression)')
# Finished
plt.legend(frameon=False, fontsize=8)
plt.xlim(xlim)
plt.show()

5 Making Inferences

When generalising your findings to the target population, take notice of whether you are interpolating (making conclusions about data within the sample domain) or extrapolating (making conclusions outside the sample domain). The domain is defined by the maximum and minimum values of the independent variable:

minimum = min(df['bmi'])
maximum = max(df['bmi'])

print(f'{minimum:.1f}, {maximum:.1f}')

## 18.4, 35.8

This means we can predict the disease progression for any patient with a BMI between about 18 and 36 kg/m². For predictions outside of this range we would need more data. With this in mind, let’s re-visit the code that will fit the model and give us the parameters but this time let’s use those to actually make a prediction about someone who has a BMI of, let’s say, 25:

# Fit the model
model = sm.OLS.from_formula('target ~ bmi', df)
results = model.fit()

# Use the model parameters to make a prediction
# (y = mx + c)
bmi = 25
prediction = results.params[1] * 25 + results.params[0]

print(round(prediction, 1))

## 160.6

We predict that someone with a BMI of 25 kg/m² will have a target of 160.6. This can also be done using the .predict() method which takes in a dictionary as its argument:

# Use the predict method to make a prediction
newdata = {'bmi': 25}
prediction = results.predict(newdata)

print(round(prediction[0], 1))

## 160.6

As expected, we get the same answer.

6 Normality Assumption

Simple linear regression gives us a model to explain the variability in a dataset, but it does not account for all the variability. The fact that not every single data point lies exactly on the line of best fit (indeed, it’s likely that none of the points lie on this line) is due to this unaccounted for error. This is known as the ‘measurement error’ or the ‘residual(s)’. One of the fundamental assumptions that simple linear regression makes is that these residuals are normally distributed, so it’s a good idea to check that this is at least reasonably valid. If it is, then it means that our results (parameters, predictions, confidence intervals, etc) are also reasonably valid. We can ‘eyeball’ this by:

Calculating the fitted values (using our model to predict our raw data)
Calculating the residuals (how wrong our predictions are)
Plotting a histogram of these residuals and checking that it’s roughly Gaussian in shape with a mean of approximately zero

# Calculate fitted values
fitted_values = results.predict(df)
# Calculate residuals
residuals = df['target'] - fitted_values

# Plot
plt.hist(residuals, color='gray', edgecolor='black')
# Title and labels
plt.title('Distribution of the Values of the Residuals')
plt.xlabel('Residuals')
plt.ylabel('Frequency')

There is no clear skewness, kurtosis, offset or presence of multiple ‘humps’, so we’re probably fine with our normality assumption.

7 Homoscedasticity Assumption

A second assumption that is necessary in order to make our simple linear regression results valid is that the residuals have equal variation across all values of the predictor variable (in our example this is BMI). By plotting the fitted values against the residuals we can again ‘eyeball’ how reasonable this assumption is: we want the points to appear randomly scattered about around a centre of y = 0:

This assumption looks reasonable to me.

⇦ Back

Statistics in Python:Simple Linear Regression