scikit-learn
Toy Datasets in Python:For more info, see here:
https://scikit-learn.org/stable/datasets/toy_dataset.html#boston-house-prices-dataset
The Boston house prices dataset is loaded using the load_boston()
function:
from sklearn import datasets
# Load the dataset
boston = datasets.load_boston()
This returns a ‘Bunch’ object with the following keys:
Key | Description |
---|---|
DESCR |
Description of the dataset |
filename |
Location of the CSV file being imported |
feature_names |
Names of the 13 groups of data |
data |
The 506 data points in each of the 13 groups of data, formatted as a 506x13 array |
target |
The target data, namely “MEDV” (Median value of owner-occupied homes in $1000’s) |
# Show the dataset's keys
print(list(boston))
## ['data', 'target', 'feature_names', 'DESCR', 'filename']
Here’s what all the keys contain:
# Description of the dataset
print(boston['DESCR'])
## .. _boston_dataset:
##
## Boston house prices dataset
## ---------------------------
##
## **Data Set Characteristics:**
##
## :Number of Instances: 506
##
## :Number of Attributes: 13 numeric/categorical predictive. Median Value (attribute 14) is usually the target.
##
## :Attribute Information (in order):
## - CRIM per capita crime rate by town
## - ZN proportion of residential land zoned for lots over 25,000 sq.ft.
## - INDUS proportion of non-retail business acres per town
## - CHAS Charles River dummy variable (= 1 if tract bounds river; 0 otherwise)
## - NOX nitric oxides concentration (parts per 10 million)
## - RM average number of rooms per dwelling
## - AGE proportion of owner-occupied units built prior to 1940
## - DIS weighted distances to five Boston employment centres
## - RAD index of accessibility to radial highways
## - TAX full-value property-tax rate per $10,000
## - PTRATIO pupil-teacher ratio by town
## - B 1000(Bk - 0.63)^2 where Bk is the proportion of black people by town
## - LSTAT % lower status of the population
## - MEDV Median value of owner-occupied homes in $1000's
##
## :Missing Attribute Values: None
##
## :Creator: Harrison, D. and Rubinfeld, D.L.
##
## This is a copy of UCI ML housing dataset.
## https://archive.ics.uci.edu/ml/machine-learning-databases/housing/
##
##
## This dataset was taken from the StatLib library which is maintained at Carnegie Mellon University.
##
## The Boston house-price data of Harrison, D. and Rubinfeld, D.L. 'Hedonic
## prices and the demand for clean air', J. Environ. Economics & Management,
## vol.5, 81-102, 1978. Used in Belsley, Kuh & Welsch, 'Regression diagnostics
## ...', Wiley, 1980. N.B. Various transformations are used in the table on
## pages 244-261 of the latter.
##
## The Boston house-price data has been used in many machine learning papers that address regression
## problems.
##
## .. topic:: References
##
## - Belsley, Kuh & Welsch, 'Regression diagnostics: Identifying Influential Data and Sources of Collinearity', Wiley, 1980. 244-261.
## - Quinlan,R. (1993). Combining Instance-Based and Model-Based Learning. In Proceedings on the Tenth International Conference of Machine Learning, 236-243, University of Massachusetts, Amherst. Morgan Kaufmann.
# Location of the CSV file being imported
print(boston['filename'])
## /usr/local/lib/python3.9/dist-packages/sklearn/datasets/data/boston_house_prices.csv
# Names of the 13 groups of data
print(boston['feature_names'])
## ['CRIM' 'ZN' 'INDUS' 'CHAS' 'NOX' 'RM' 'AGE' 'DIS' 'RAD' 'TAX' 'PTRATIO'
## 'B' 'LSTAT']
# The 506 data points in each of the 13 groups of data, formatted as a 506x13 array
print(boston['data'])
## [[6.3200e-03 1.8000e+01 2.3100e+00 ... 1.5300e+01 3.9690e+02 4.9800e+00]
## [2.7310e-02 0.0000e+00 7.0700e+00 ... 1.7800e+01 3.9690e+02 9.1400e+00]
## [2.7290e-02 0.0000e+00 7.0700e+00 ... 1.7800e+01 3.9283e+02 4.0300e+00]
## ...
## [6.0760e-02 0.0000e+00 1.1930e+01 ... 2.1000e+01 3.9690e+02 5.6400e+00]
## [1.0959e-01 0.0000e+00 1.1930e+01 ... 2.1000e+01 3.9345e+02 6.4800e+00]
## [4.7410e-02 0.0000e+00 1.1930e+01 ... 2.1000e+01 3.9690e+02 7.8800e+00]]
# The target data, namely "MEDV" (Median value of owner-occupied homes in $1000’s)
print(boston['target'][:20])
## [24. 21.6 34.7 33.4 36.2 28.7 22.9 27.1 16.5 18.9 15. 18.9 21.7 20.4
## 18.2 19.9 23.1 17.5 20.2 18.2]
The groups of data can be plotted against one another or against the target group as follows:
import matplotlib.pyplot as plt
# Plot
plt.scatter(boston['data'][:, 6], boston['target'])
plt.title('Boston house prices dataset')
x = boston['feature_names'][6]
plt.xlabel(x + ' (proportion of owner-occupied units built prior to 1940)')
plt.ylabel("MEDV (Median value of owner-occupied homes in $1000's)")
The data is much easier to work with if it is converted into a data frame:
import pandas as pd
# Extract the data
data = pd.DataFrame(boston['data'], columns=boston['feature_names'])
# Extract the target
target = pd.DataFrame(boston['target'], columns=['MEDV'])
# Combine into one dataset
df = pd.concat([target, data], axis='columns')
# Plot
plt.scatter(df['RM'], df['MEDV'])
plt.title('Boston house prices dataset')
plt.xlabel('RM (average number of rooms per dwelling)')
plt.ylabel("MEDV (Median value of owner-occupied homes in $1000's)")