scikit-learn
Toy Datasets in Python:The Breast Cancer Wisconsin (Diagnostic) dataset is available as a ‘Bunch’ object. This contains both the data itself and metadata that provides additional information. It can be loaded using the load_breast_cancer()
function from scikit-learn’s datasets
sub-module, and it is recommended to set the as_frame
parameter to True
when doing so as this will cause the actual data (as opposed to the metadata) to be loaded as Pandas data frames (as opposed to NumPy arrays).
from sklearn import datasets
# Load the dataset
breast_cancer = datasets.load_breast_cancer(as_frame=True)
The dataset contains 569 samples (instances) with 30 features (independent variables) and one target (dependent variable) for each:
So, when formatted as a data frame, the data consists of 569 rows and 30 + 1 columns (30 features and 1 target). The feature and target data can be extracted separately as two data frames or together in one data frame:
# Extract the feature data only
features = breast_cancer['data']
# Extract the target data only
target = breast_cancer['target']
# Extract the feature and target data together
df = breast_cancer['frame']
The columns names of the feature data are the same as the features’ names, which are also available in a separate feature_names
array:
print(breast_cancer['feature_names'])
## ['mean radius' 'mean texture' 'mean perimeter' 'mean area'
## 'mean smoothness' 'mean compactness' 'mean concavity'
## 'mean concave points' 'mean symmetry' 'mean fractal dimension'
## 'radius error' 'texture error' 'perimeter error' 'area error'
## 'smoothness error' 'compactness error' 'concavity error'
## 'concave points error' 'symmetry error' 'fractal dimension error'
## 'worst radius' 'worst texture' 'worst perimeter' 'worst area'
## 'worst smoothness' 'worst compactness' 'worst concavity'
## 'worst concave points' 'worst symmetry' 'worst fractal dimension']
The target names are malignant
and benign
, corresponding to the two disease states of the tumours. These are encoded as 0 and 1 respectively:
print(target.unique(), breast_cancer['target_names'])
## [0 1] ['malignant' 'benign']
The target data can be decoded using a lambda function:
# Clean the raw data
df['target'] = df['target'].apply(lambda x: breast_cancer['target_names'][x])
Here’s a preview of the data:
print(df.tail())
## mean radius mean texture mean perimeter mean area mean smoothness \
## 564 21.56 22.39 142.00 1479.0 0.11100
## 565 20.13 28.25 131.20 1261.0 0.09780
## 566 16.60 28.08 108.30 858.1 0.08455
## 567 20.60 29.33 140.10 1265.0 0.11780
## 568 7.76 24.54 47.92 181.0 0.05263
##
## mean compactness ... worst compactness worst concavity \
## 564 0.11590 ... 0.21130 0.4107
## 565 0.10340 ... 0.19220 0.3215
## 566 0.10230 ... 0.30940 0.3403
## 567 0.27700 ... 0.86810 0.9387
## 568 0.04362 ... 0.06444 0.0000
##
## worst concave points worst symmetry worst fractal dimension target
## 564 0.2216 0.2060 0.07115 malignant
## 565 0.1628 0.2572 0.06637 malignant
## 566 0.1418 0.2218 0.07820 malignant
## 567 0.2650 0.4087 0.12400 malignant
## 568 0.0000 0.2871 0.07039 benign
##
## [5 rows x 31 columns]
from matplotlib import pyplot as plt
from matplotlib import lines
import seaborn as sns
# Plot
ax = plt.axes()
sns.boxplot(
df, x='target', y='mean smoothness', color='lightgrey', whis=[0, 100],
showmeans=True, meanline=True, meanprops={'color': 'black'}
)
sns.stripplot(
df, x='target', y='mean smoothness',
color='lightgrey', edgecolor='black', linewidth=1
)
plt.title('Breast Cancer Wisconsin Dataset')
plt.ylabel('Mean Smoothness')
plt.xlabel('')