The Breast Cancer Wisconsin (Diagnostic) dataset is available as a ‘Bunch’ object. This contains both the data itself and metadata that provides additional information. It can be loaded using the load_breast_cancer() function from scikit-learn’s datasets sub-module, and it is recommended to set the as_frame parameter to True when doing so as this will cause the actual data (as opposed to the metadata) to be loaded as Pandas data frames (as opposed to NumPy arrays).

from sklearn import datasets

# Load the dataset
breast_cancer = datasets.load_breast_cancer(as_frame=True)

The dataset contains 569 samples (instances) with 30 features (independent variables) and one target (dependent variable) for each:

So, when formatted as a data frame, the data consists of 569 rows and 30 + 1 columns (30 features and 1 target). The feature and target data can be extracted separately as two data frames or together in one data frame:

# Extract the feature data only
features = breast_cancer['data']

# Extract the target data only
target = breast_cancer['target']

# Extract the feature and target data together
df = breast_cancer['frame']

The columns names of the feature data are the same as the features’ names, which are also available in a separate feature_names array:

print(breast_cancer['feature_names'])
## ['mean radius' 'mean texture' 'mean perimeter' 'mean area'
##  'mean smoothness' 'mean compactness' 'mean concavity'
##  'mean concave points' 'mean symmetry' 'mean fractal dimension'
##  'radius error' 'texture error' 'perimeter error' 'area error'
##  'smoothness error' 'compactness error' 'concavity error'
##  'concave points error' 'symmetry error' 'fractal dimension error'
##  'worst radius' 'worst texture' 'worst perimeter' 'worst area'
##  'worst smoothness' 'worst compactness' 'worst concavity'
##  'worst concave points' 'worst symmetry' 'worst fractal dimension']

The target names are malignant and benign, corresponding to the two disease states of the tumours. These are encoded as 0 and 1 respectively:

print(target.unique(), breast_cancer['target_names'])
## [0 1] ['malignant' 'benign']

The target data can be decoded using a lambda function:

# Clean the raw data
df['target'] = df['target'].apply(lambda x: breast_cancer['target_names'][x])

Here’s a preview of the data:

print(df.tail())
##      mean radius  mean texture  mean perimeter  mean area  mean smoothness  \
## 564        21.56         22.39          142.00     1479.0         0.11100    
## 565        20.13         28.25          131.20     1261.0         0.09780    
## 566        16.60         28.08          108.30      858.1         0.08455    
## 567        20.60         29.33          140.10     1265.0         0.11780    
## 568         7.76         24.54           47.92      181.0         0.05263    
## 
##      mean compactness  ...  worst compactness  worst concavity  \
## 564         0.11590    ...         0.21130             0.4107    
## 565         0.10340    ...         0.19220             0.3215    
## 566         0.10230    ...         0.30940             0.3403    
## 567         0.27700    ...         0.86810             0.9387    
## 568         0.04362    ...         0.06444             0.0000    
## 
##      worst concave points  worst symmetry  worst fractal dimension     target  
## 564          0.2216                0.2060         0.07115           malignant  
## 565          0.1628                0.2572         0.06637           malignant  
## 566          0.1418                0.2218         0.07820           malignant  
## 567          0.2650                0.4087         0.12400           malignant  
## 568          0.0000                0.2871         0.07039              benign  
## 
## [5 rows x 31 columns]

Example Usage

from matplotlib import pyplot as plt
from matplotlib import lines
import seaborn as sns

# Plot
ax = plt.axes()
sns.boxplot(
   df, x='target', y='mean smoothness', color='lightgrey', whis=[0, 100],
   showmeans=True, meanline=True, meanprops={'color': 'black'}
)
sns.stripplot(
   df, x='target', y='mean smoothness',
   color='lightgrey', edgecolor='black', linewidth=1
)
plt.title('Breast Cancer Wisconsin Dataset')
plt.ylabel('Mean Smoothness')
plt.xlabel('')