scikit-learn
Toy Datasets in Python:For more information on this dataset:
load_diabetes()
function which imports this datasetThe diabetes dataset contains measurements taken from 442 diabetic patients:
age
- age in yearssex
- male or femalebmi
- body mass indexbp
- average blood pressures1
- TC: total serum cholesterols2
- LDL: low-density lipoproteinss3
- HDL: high-density lipoproteinss4
- TCH: total cholesterol / HDLs5
- LTG: possibly log of serum triglycerides levels6
- GLU: blood sugar levelEach of the 10 feature variables have been mean centered and scaled by the standard deviation times the square root of the number of sample (ie the sum of squares of each column totals 1).
The dataset can be loaded using load_diabetes()
or load_diabetes(as_frame=True)
. Both return a ‘Bunch’ object which can be indexed as if it were a dictionary with the following being the most important keys:
Key | Value |
---|---|
DESCR |
Description of the dataset |
feature_names |
Names of the 10 features (the baseline measurements taken) |
data |
The 442 baseline data points, formatted as a 442x10 NumPy array by default or as a 442x10 pandas data frame if as_frame=True was used |
target |
The 442 one-year follow-up data points - namely the values for disease progression - formatted as a NumPy array by default or as a pandas series if as_frame=True was used |
Example usage:
from sklearn import datasets
from matplotlib import pyplot as plt
# Load the dataset
diabetes = datasets.load_diabetes(as_frame=True)
# Don't plot the sex data
features = diabetes['feature_names']
features.remove('sex')
# Plot
fig, axs = plt.subplots(3, 3)
fig.suptitle('Diabetes Dataset')
for i in range(3):
for j in range(3):
n = j + i * 3
feature = features[n]
axs[i, j].scatter(diabetes['data'][feature], diabetes['target'], s=1)
axs[i, j].set_xlabel(feature)
axs[i, j].set_ylabel('target')
plt.tight_layout()
plt.show()