scikit-learn
Toy Datasets in Python:The wine recognition dataset is loaded using load_wine()
. This returns a ‘Bunch’ object which contains both the data itself as well as metadata. By default the data is formatted as NumPy arrays but, by setting the as_frame
parameter to True
when loading the dataset, this can be changed so as to use Pandas data frames:
from sklearn import datasets
# Load the dataset
wine = datasets.load_wine(as_frame=True)
The data contains results from the chemical analyses of 178 different wines, ie there are 178 samples or instances in the dataset. The wines came from 3 different cultivators in the same region of Italy, and this is the target or class information. There were 13 measurements taken during each analysis, so there are 13 features or attributes. So, when formatted as a data frame, the data consists of 178 rows and 13 + 1 columns (13 features and 1 target). The feature and target data can be extracted separately as two data frames or together in one data frame:
# Extract the feature data only
features = wine['data']
# Extract the target data only
target = wine['target']
# Extract the feature and target data together
df = wine['frame']
print(df.head())
## alcohol malic_acid ash ... od280/od315_of_diluted_wines proline target
## 0 14.23 1.71 2.43 ... 3.92 1065.0 0
## 1 13.20 1.78 2.14 ... 3.40 1050.0 0
## 2 13.16 2.36 2.67 ... 3.17 1185.0 0
## 3 14.37 1.95 2.50 ... 3.45 1480.0 0
## 4 13.24 2.59 2.87 ... 2.93 735.0 0
##
## [5 rows x 14 columns]
The column names of the first 13 columns are the features names, and these are also available in in a separate feature_names
array:
print(wine['feature_names'])
## ['alcohol', 'malic_acid', 'ash', 'alcalinity_of_ash', 'magnesium', 'total_phenols', 'flavanoids', 'nonflavanoid_phenols', 'proanthocyanins', 'color_intensity', 'hue', 'od280/od315_of_diluted_wines', 'proline']
The column name of the 14th column is target
which indicates that this is the target information, ie which cultivator the wine in question came from. These are simply the values 0
, 1
and 2
:
print(df['target'].unique())
## [0 1 2]
from matplotlib import pyplot as plt
import seaborn as sns
fig, axs = plt.subplots(5, 3, figsize=(8, 10))
for i, ax in enumerate(fig.get_axes()):
if i < 13:
feature = wine['feature_names'][i]
sns.boxplot(df, x='target', y=feature, whis=[0, 100], ax=ax)
ax.set_title(feature)
ax.set_ylabel('')
ax.set_xlabel('')
fig.delaxes(axs[(4, 2)])
fig.delaxes(axs[(4, 1)])
plt.tight_layout()
plt.show()