scikit-learn Toy Datasets in Python: Boston house prices dataset

For more info, see here:
https://scikit-learn.org/stable/datasets/toy_dataset.html#boston-house-prices-dataset

The Boston house prices dataset is loaded using the load_boston() function:

from sklearn import datasets

# Load the dataset
boston = datasets.load_boston()

This returns a ‘Bunch’ object with the following keys:

Key	Description
`DESCR`	Description of the dataset
`filename`	Location of the CSV file being imported
`feature_names`	Names of the 13 groups of data
`data`	The 506 data points in each of the 13 groups of data, formatted as a 506x13 array
`target`	The target data, namely “MEDV” (Median value of owner-occupied homes in $1000’s)

# Show the dataset's keys
print(list(boston))

## ['data', 'target', 'feature_names', 'DESCR', 'filename']

Here’s what all the keys contain:

# Description of the dataset
print(boston['DESCR'])

## .. _boston_dataset:
## 
## Boston house prices dataset
## ---------------------------
## 
## **Data Set Characteristics:**  
## 
##     :Number of Instances: 506 
## 
##     :Number of Attributes: 13 numeric/categorical predictive. Median Value (attribute 14) is usually the target.
## 
##     :Attribute Information (in order):
##         - CRIM     per capita crime rate by town
##         - ZN       proportion of residential land zoned for lots over 25,000 sq.ft.
##         - INDUS    proportion of non-retail business acres per town
##         - CHAS     Charles River dummy variable (= 1 if tract bounds river; 0 otherwise)
##         - NOX      nitric oxides concentration (parts per 10 million)
##         - RM       average number of rooms per dwelling
##         - AGE      proportion of owner-occupied units built prior to 1940
##         - DIS      weighted distances to five Boston employment centres
##         - RAD      index of accessibility to radial highways
##         - TAX      full-value property-tax rate per $10,000
##         - PTRATIO  pupil-teacher ratio by town
##         - B        1000(Bk - 0.63)^2 where Bk is the proportion of black people by town
##         - LSTAT    % lower status of the population
##         - MEDV     Median value of owner-occupied homes in $1000's
## 
##     :Missing Attribute Values: None
## 
##     :Creator: Harrison, D. and Rubinfeld, D.L.
## 
## This is a copy of UCI ML housing dataset.
## https://archive.ics.uci.edu/ml/machine-learning-databases/housing/
## 
## 
## This dataset was taken from the StatLib library which is maintained at Carnegie Mellon University.
## 
## The Boston house-price data of Harrison, D. and Rubinfeld, D.L. 'Hedonic
## prices and the demand for clean air', J. Environ. Economics & Management,
## vol.5, 81-102, 1978.   Used in Belsley, Kuh & Welsch, 'Regression diagnostics
## ...', Wiley, 1980.   N.B. Various transformations are used in the table on
## pages 244-261 of the latter.
## 
## The Boston house-price data has been used in many machine learning papers that address regression
## problems.   
##      
## .. topic:: References
## 
##    - Belsley, Kuh & Welsch, 'Regression diagnostics: Identifying Influential Data and Sources of Collinearity', Wiley, 1980. 244-261.
##    - Quinlan,R. (1993). Combining Instance-Based and Model-Based Learning. In Proceedings on the Tenth International Conference of Machine Learning, 236-243, University of Massachusetts, Amherst. Morgan Kaufmann.

# Location of the CSV file being imported
print(boston['filename'])

## /usr/local/lib/python3.9/dist-packages/sklearn/datasets/data/boston_house_prices.csv

# Names of the 13 groups of data
print(boston['feature_names'])

## ['CRIM' 'ZN' 'INDUS' 'CHAS' 'NOX' 'RM' 'AGE' 'DIS' 'RAD' 'TAX' 'PTRATIO'
##  'B' 'LSTAT']

# The 506 data points in each of the 13 groups of data, formatted as a 506x13 array
print(boston['data'])

## [[6.3200e-03 1.8000e+01 2.3100e+00 ... 1.5300e+01 3.9690e+02 4.9800e+00]
##  [2.7310e-02 0.0000e+00 7.0700e+00 ... 1.7800e+01 3.9690e+02 9.1400e+00]
##  [2.7290e-02 0.0000e+00 7.0700e+00 ... 1.7800e+01 3.9283e+02 4.0300e+00]
##  ...
##  [6.0760e-02 0.0000e+00 1.1930e+01 ... 2.1000e+01 3.9690e+02 5.6400e+00]
##  [1.0959e-01 0.0000e+00 1.1930e+01 ... 2.1000e+01 3.9345e+02 6.4800e+00]
##  [4.7410e-02 0.0000e+00 1.1930e+01 ... 2.1000e+01 3.9690e+02 7.8800e+00]]

# The target data, namely "MEDV" (Median value of owner-occupied homes in $1000’s)
print(boston['target'][:20])

## [24.  21.6 34.7 33.4 36.2 28.7 22.9 27.1 16.5 18.9 15.  18.9 21.7 20.4
##  18.2 19.9 23.1 17.5 20.2 18.2]

The groups of data can be plotted against one another or against the target group as follows:

import matplotlib.pyplot as plt

# Plot
plt.scatter(boston['data'][:, 6], boston['target'])
plt.title('Boston house prices dataset')
x = boston['feature_names'][6]
plt.xlabel(x + ' (proportion of owner-occupied units built prior to 1940)')
plt.ylabel("MEDV (Median value of owner-occupied homes in $1000's)")

The data is much easier to work with if it is converted into a data frame:

import pandas as pd

# Extract the data
data = pd.DataFrame(boston['data'], columns=boston['feature_names'])
# Extract the target
target = pd.DataFrame(boston['target'], columns=['MEDV'])
# Combine into one dataset
df = pd.concat([target, data], axis='columns')

# Plot
plt.scatter(df['RM'], df['MEDV'])
plt.title('Boston house prices dataset')
plt.xlabel('RM (average number of rooms per dwelling)')
plt.ylabel("MEDV (Median value of owner-occupied homes in $1000's)")

scikit-learn Toy Datasets in Python:Boston house prices dataset

`scikit-learn` Toy Datasets in Python:
Boston house prices dataset