The code on this page uses the Statsmodels, Matplotlib, Seaborn, scikit-learn and NumPy packages. These can be installed from the terminal with the following commands:
# "python3.11" corresponds to the version of Python you have installed and are using
$ python3.11 -m pip install statsmodels
$ python3.11 -m pip install matplotlib
$ python3.11 -m pip install seaborn
$ python3.11 -m pip install sklearn
$ python3.11 -m pip install numpy
Once finished, import these packages into your Python script as follows:
from statsmodels import api as sm
from matplotlib import pyplot as plt
import seaborn as sns
from sklearn import model_selection
from sklearn import linear_model
import numpy as np
This page will use the Longley Dataset from Statsmodels (see here for the documentation and the “longley” tab on this page for an example). Import it with the following:
# Load the data
dataset = sm.datasets.longley.load_pandas()
We will use the GNP
and ARMED
columns from the exog
data frame as the exogenous variables (aka the independent variables, predictor variables or features). These contain 16 countries’ gross national product and armed forces size, respectively, from 1967. The endog
variable (endogenous variable, aka the dependent variable, outcome variable or target) contains the total employment values from these countries:
# Separate out the features (exogenous variables)
cols = ['GNP', 'ARMED']
X = dataset['exog'][cols]
# Separate out the target (endogenous variable)
y = dataset['endog']
Divide everything by 1000 just to re-scale it all:
# Scale
X.loc[:, 'GNP'] = X['GNP'] / 1000
X.loc[:, 'ARMED'] = X['ARMED'] / 1000
y = y / 1000
Now visualise the raw data:
# Plot
sns.set_style('darkgrid')
sns.set_palette('deep')
plt.figure(figsize=(6.4, 9))
plt.suptitle('Longley Dataset')
# First sub-plot
ax1 = plt.subplot(211)
ax1.scatter(X['GNP'], y)
ax1.set_ylabel("Total Employment ('000s)")
ax1.set_xlabel("Gross National Product ('000s)")
# Second sub-plot
ax2 = plt.subplot(212)
ax2.scatter(X['ARMED'], y)
ax2.set_ylabel("Total Employment ('000s)")
ax2.set_xlabel("Size of Armed Forces ('000s)")
# Finish
plt.tight_layout()
plt.show()
Any type of statistical analysis will make assumptions. Multiple linear regression makes four and we need to check that they are met:
# Plot a heatmap of the correlations between pairs of predictor variables
corr_grid = X.corr()
sns.heatmap(
corr_grid, xticklabels=corr_grid.columns, yticklabels=corr_grid.columns,
vmin=-1, center=0, vmax=1, cmap='PuOr', annot=True
)
plt.show()
A correlation of 0.45 between the only pair of predictor variables is sufficiently low.
Start by splitting the dataset up into a training set and a testing set. Use a 70/30 split:
# Split into training and testing sets
X_train, X_test, y_train, y_test = model_selection.train_test_split(
X, y, random_state=20230922, test_size=0.3
)
Create and fit the model using the training set:
# Create the model
regr = linear_model.LinearRegression()
# Train the model
regr.fit(X_train.values, y_train.values)
Get the equation of the model:
\(y = m_0 x_0 + m_1 x_1 + ... + m_n x_n + c\)
# Get the equation
print(f'y = {regr.coef_[0]:.2f} x₀ {regr.coef_[1]:+.2f} x₁ {regr.intercept_:+.2f}')
## y = 0.04 x₀ +0.13 x₁ +51.50
Use the model to make a prediction: what would the employment be in a country with a GNP of 300 and a armed forces size of 3?
# Use the model to make a prediction
country = [[300, 3]]
predict = regr.predict(country)
print(f'Predicted employment: {predict[0]:.2f}')
## Predicted employment: 62.42
Create a 3D plot to try visual the model:
# Plot
fig = plt.figure()
ax = fig.add_subplot(projection='3d')
ax.scatter(X['GNP'], X['ARMED'], y)
x_fitted = np.array([[250, 250], [550, 550]])
y_fitted = np.array([[1.2, 3.8], [1.2, 3.8]])
z_fitted = np.array([[250, 250, 550, 550], [1.2, 3.8, 1.2, 3.8]]).T
z_fitted = regr.predict(z_fitted).reshape((2, 2))
ax.plot_surface(x_fitted, y_fitted, z_fitted, alpha=0.6)
ax.set_title('Multiple Linear Regression')
ax.set_xlabel('Gross National Product')
ax.set_ylabel('Size of Armed Forces')
ax.set_zlabel('Total Employment')
plt.show()
Visualise the difference between the predicted and actual values of the testing set:
# Test the model's predictions
y_pred = regr.predict(X_test.values)
# Plot
plt.figure(figsize=(6.4, 4.8))
ax = plt.axes()
ax.scatter(y_test, y_pred)
ax.plot([63, 71], [63, 71], c='lightgray')
ax.set_title('Actual vs Predicted Employment')
ax.set_ylabel('Predicted Employment')
ax.set_xlabel('Actual Employment')
ax.set_ylim(63, 71)
ax.set_xlim(63, 71)
ax.set_aspect('equal', 'box')
plt.show()
The coefficient of determination, \(R^2\), is the proportion of the variation in \(y\) that is explained by all the \(x\) variables together:
# Coefficient of determination, R²
u = ((y_test - y_pred) ** 2).sum() # Residual sum of squares
v = ((y_test - y_test.mean()) ** 2).sum() # Total sum of squares (TSS)
r_squared = 1 - (u / v)
print(f'R² = {r_squared:.3f}')
## R² = 0.930
Instead of calculating it manually, we can use the .score()
method:
# Coefficient of determination, R²
r_squared = regr.score(X_test.values, y_test.values)
print(f'R² = {r_squared:.3f}')
## R² = 0.930
In other words, 93.0% of the variation in total employment is explained by the variation in GNP and size of armed forces together.