⇦ Back

This page demonstrates how to perform simple linear regression using Ordinary Least Squares with scikit-learn, see here for the documentation and here for an example.

1 Python Packages

The code on this page uses the Statsmodels, Matplotlib, Seaborn, NumPy and scikit-learn packages. These can be installed from the terminal with the following commands:

# "python3.11" corresponds to the version of Python you have installed and are using
$ python3.11 -m pip install statsmodels
$ python3.11 -m pip install matplotlib
$ python3.11 -m pip install seaborn
$ python3.11 -m pip install numpy
$ python3.11 -m pip install sklearn

Once finished, import these packages into your Python script as follows:

from statsmodels import api as sm
from matplotlib import pyplot as plt
import seaborn as sns
import numpy as np
from sklearn import linear_model
from sklearn import model_selection

2 Example Data

This page will use the World Copper Market 1951-1975 Dataset from Statsmodels, see here for the documentation and the “copper” tab on this page for an example.

# Load the data
dataset = sm.datasets.copper.load_pandas()

We will use the INCOMEINDEX column - an index of real per capita income - as the independent variable and the WORLDCONSUMPTION column - the world consumption of copper in millions of kilograms - as the dependent variable:

# Plot
sns.set_style('darkgrid')
sns.set_palette('deep')
ax = plt.axes()
sns.scatterplot(dataset['data'], x='INCOMEINDEX', y='WORLDCONSUMPTION')
plt.title('World Copper Market 1951-1975 Dataset')
plt.ylabel('World Consumption of Copper (×10⁶ kg)')
plt.xlabel('Index of Real per Capita Income (Base 1970)')
plt.show()

3 Example 1: Using All the Data

Separate out the feature (aka the independent or exogenous variable) and the target (aka the dependent or endogenous variable):

# Separate out the feature (exogenous variable)
X = dataset['exog']['INCOMEINDEX']
# Reshape from a 1D array into a 2D array
X = np.array(X).reshape(-1, 1)
# Separate out the target (endogenous variable)
y = dataset['endog']

Create and train the model:

# Create the model
regr = linear_model.LinearRegression()
# Train the model
regr.fit(X, y)

Use the model to make a prediction using all the data:

# Use the model to make a prediction
X_fitted = np.array([np.min(X), np.max(X)]).reshape(-1, 1)
y_fitted = regr.predict(X_fitted)

Now let’s visualise it:

# Plot
plt.scatter(X, y, ec='w')
label = f'y = {np.round(regr.coef_[0], 2)}x + {np.round(regr.intercept_, 2)}'
plt.plot(X_fitted, y_fitted, c='gray', label=label)
plt.title('World Copper Market 1951-1975 Dataset')
plt.ylabel('World Consumption of Copper (×10⁶ kg)')
plt.xlabel('Index of Real per Capita Income (Base 1970)')
plt.legend(frameon=False)
plt.show()

4 Example 2: Using Training and Testing Sets

Split the dataset up so that 70% is used for training and 30% is used for testing:

# Split into training and testing sets
X_train, X_test, y_train, y_test = model_selection.train_test_split(
    X, y, random_state=20230922, test_size=0.3
)

Create and fit a new model:

# Create the model
regr = linear_model.LinearRegression()
# Train the model
regr.fit(X_train, y_train)

Use it to make predictions:

# Use the model to make a prediction
y_pred = regr.predict(X_test)

Visualise:

# Plot
label = f'y = {np.round(regr.coef_[0], 2)}x + {np.round(regr.intercept_, 2)}'
plt.plot(X_test, y_pred, c='gray', label=label)
plt.scatter(X_train, y_train, ec='w', label='Training set')
plt.scatter(X_test, y_test, ec='w', label='Testing set')
plt.scatter(X_test, y_pred, ec='w', label='Predicted values')
plt.title('World Copper Market 1951-1975 Dataset')
plt.ylabel('World Consumption of Copper (×10⁶ kg)')
plt.xlabel('Index of Real per Capita Income (Base 1970)')
plt.legend(frameon=False)
plt.show()

Notice that each green dot lies on the grey line and exactly above or below an orange dot.

⇦ Back