⇦ Back

1 Python Packages

The code on this page uses the Statsmodels, Matplotlib, Seaborn, scikit-learn and NumPy packages. These can be installed from the terminal with the following commands:

# "python3.11" corresponds to the version of Python you have installed and are using
$ python3.11 -m pip install statsmodels
$ python3.11 -m pip install matplotlib
$ python3.11 -m pip install seaborn
$ python3.11 -m pip install sklearn
$ python3.11 -m pip install numpy

Once finished, import these packages into your Python script as follows:

from statsmodels import api as sm
from matplotlib import pyplot as plt
import seaborn as sns
from sklearn import model_selection
from sklearn import linear_model
import numpy as np

2 Example Data

This page will use the Longley Dataset from Statsmodels (see here for the documentation and the “longley” tab on this page for an example). Import it with the following:

# Load the data
dataset = sm.datasets.longley.load_pandas()

We will use the GNP and ARMED columns from the exog data frame as the exogenous variables (aka the independent variables, predictor variables or features). These contain 16 countries’ gross national product and armed forces size, respectively, from 1967. The endog variable (endogenous variable, aka the dependent variable, outcome variable or target) contains the total employment values from these countries:

# Separate out the features (exogenous variables)
cols = ['GNP', 'ARMED']
X = dataset['exog'][cols]
# Separate out the target (endogenous variable)
y = dataset['endog']

Divide everything by 1000 just to re-scale it all:

# Scale
X.loc[:, 'GNP'] = X['GNP'] / 1000
X.loc[:, 'ARMED'] = X['ARMED'] / 1000
y = y / 1000

Now visualise the raw data:

# Plot
sns.set_style('darkgrid')
sns.set_palette('deep')
plt.figure(figsize=(6.4, 9))
plt.suptitle('Longley Dataset')
# First sub-plot
ax1 = plt.subplot(211)
ax1.scatter(X['GNP'], y)
ax1.set_ylabel("Total Employment ('000s)")
ax1.set_xlabel("Gross National Product ('000s)")
# Second sub-plot
ax2 = plt.subplot(212)
ax2.scatter(X['ARMED'], y)
ax2.set_ylabel("Total Employment ('000s)")
ax2.set_xlabel("Size of Armed Forces ('000s)")
# Finish
plt.tight_layout()
plt.show()

3 Assumptions

Any type of statistical analysis will make assumptions. Multiple linear regression makes four and we need to check that they are met:

  1. Linearity can be assessed by plotting the predictor variables against the outcome variable and looking for linear relationships. This has been done above.
  2. Normality can be assessed after the model has been fitted by plotting a histogram of its residuals. This should have an approximately normal distribution.
  3. Homoscedasticity can also be checked for after fitting the model by plotting the residuals against the fitted values and confirming that there is no clear relationship
  4. Independence can be assessed by confirming that the predictor variables are not linearly related to each other (this would be referred to as multicollinearity). Check the correlations between pairs of variables in the data: correlations close to 1 or -1 indicate that they are too closely related and that one should be removed:
# Plot a heatmap of the correlations between pairs of predictor variables
corr_grid = X.corr()
sns.heatmap(
    corr_grid, xticklabels=corr_grid.columns, yticklabels=corr_grid.columns,
    vmin=-1, center=0, vmax=1, cmap='PuOr', annot=True
)
plt.show()

A correlation of 0.45 between the only pair of predictor variables is sufficiently low.

4 Multiple Linear Regression

Start by splitting the dataset up into a training set and a testing set. Use a 70/30 split:

# Split into training and testing sets
X_train, X_test, y_train, y_test = model_selection.train_test_split(
    X, y, random_state=20230922, test_size=0.3
)

Create and fit the model using the training set:

# Create the model
regr = linear_model.LinearRegression()
# Train the model
regr.fit(X_train.values, y_train.values)

Get the equation of the model:

\(y = m_0 x_0 + m_1 x_1 + ... + m_n x_n + c\)

# Get the equation
print(f'y = {regr.coef_[0]:.2f} x₀ {regr.coef_[1]:+.2f} x₁ {regr.intercept_:+.2f}')
## y = 0.04 x₀ +0.13 x₁ +51.50

Use the model to make a prediction: what would the employment be in a country with a GNP of 300 and a armed forces size of 3?

# Use the model to make a prediction
country = [[300, 3]]
predict = regr.predict(country)

print(f'Predicted employment: {predict[0]:.2f}')
## Predicted employment: 62.42

Create a 3D plot to try visual the model:

# Plot
fig = plt.figure()
ax = fig.add_subplot(projection='3d')
ax.scatter(X['GNP'], X['ARMED'], y)
x_fitted = np.array([[250, 250], [550, 550]])
y_fitted = np.array([[1.2, 3.8], [1.2, 3.8]])
z_fitted = np.array([[250, 250, 550, 550], [1.2, 3.8, 1.2, 3.8]]).T
z_fitted = regr.predict(z_fitted).reshape((2, 2))
ax.plot_surface(x_fitted, y_fitted, z_fitted, alpha=0.6)
ax.set_title('Multiple Linear Regression')
ax.set_xlabel('Gross National Product')
ax.set_ylabel('Size of Armed Forces')
ax.set_zlabel('Total Employment')
plt.show()

5 Evaluate the Model

Visualise the difference between the predicted and actual values of the testing set:

# Test the model's predictions
y_pred = regr.predict(X_test.values)

# Plot
plt.figure(figsize=(6.4, 4.8))
ax = plt.axes()
ax.scatter(y_test, y_pred)
ax.plot([63, 71], [63, 71], c='lightgray')
ax.set_title('Actual vs Predicted Employment')
ax.set_ylabel('Predicted Employment')
ax.set_xlabel('Actual Employment')
ax.set_ylim(63, 71)
ax.set_xlim(63, 71)
ax.set_aspect('equal', 'box')
plt.show()

The coefficient of determination, \(R^2\), is the proportion of the variation in \(y\) that is explained by all the \(x\) variables together:

# Coefficient of determination, R²
u = ((y_test - y_pred) ** 2).sum()  # Residual sum of squares
v = ((y_test - y_test.mean()) ** 2).sum()  # Total sum of squares (TSS)
r_squared = 1 - (u / v)

print(f'R² = {r_squared:.3f}')
## R² = 0.930

Instead of calculating it manually, we can use the .score() method:

# Coefficient of determination, R²
r_squared = regr.score(X_test.values, y_test.values)

print(f'R² = {r_squared:.3f}')
## R² = 0.930

In other words, 93.0% of the variation in total employment is explained by the variation in GNP and size of armed forces together.

⇦ Back