This page follows on from Bar Plots with One Group of Data.
This example will once again use the iris dataset from the scikit-learn package (more info here) except this time all three groups will be used:
from sklearn.datasets import load_iris
# Load the dataset
iris = load_iris()
# Print the first 10 rows of data
print(iris['data'][:10])
## [[5.1 3.5 1.4 0.2]
## [4.9 3. 1.4 0.2]
## [4.7 3.2 1.3 0.2]
## [4.6 3.1 1.5 0.2]
## [5. 3.6 1.4 0.2]
## [5.4 3.9 1.7 0.4]
## [4.6 3.4 1.4 0.3]
## [5. 3.4 1.5 0.2]
## [4.4 2.9 1.4 0.2]
## [4.9 3.1 1.5 0.1]]
The dataset is a little bit confusing in its current format, so let’s convert it into a Pandas data frame:
import pandas as pd
# Convert the array to a data frame
iris_df = pd.DataFrame(iris['data'], columns=iris['feature_names'])
# Add the species data as a column to the data frame
iris_df['Species'] = iris['target']
# Print the first 10 rows of data
print(iris_df.iloc[:10,])
## sepal length (cm) sepal width (cm) petal length (cm) petal width (cm) Species
## 0 5.1 3.5 1.4 0.2 0
## 1 4.9 3.0 1.4 0.2 0
## 2 4.7 3.2 1.3 0.2 0
## 3 4.6 3.1 1.5 0.2 0
## 4 5.0 3.6 1.4 0.2 0
## 5 5.4 3.9 1.7 0.4 0
## 6 4.6 3.4 1.4 0.3 0
## 7 5.0 3.4 1.5 0.2 0
## 8 4.4 2.9 1.4 0.2 0
## 9 4.9 3.1 1.5 0.1 0
This dataset contains information about 150 different iris flowers of three types: ‘setosa’, ‘versicolor’ and ‘virginica’. The first four columns of the data frame contain the measurements that were made on each flower: sepals length, sepal width, petal length and petal width. The fifth column, ‘Species’ contains either 0, 1 or 2 to show which of the three species of iris the flower was. There are 150 rows in the data frame: one for each flower that was examined and 50 of each species.
Similar to the previous examples, the unique()
function from Numpy can be used to extract the unique values of the sepal widths and the return_counts
option can be used to count the number of occurrences of each:
import matplotlib.pyplot as plt
import numpy as np
# Make figures A6
A = 6
plt.rc('figure', figsize=[46.82 * .5**(.5 * A), 33.11 * .5**(.5 * A)])
# Use Latex
plt.rc('text', usetex=True)
plt.rc('font', family='serif')
plt.rc('text.latex', preamble=r'\usepackage{textgreek}')
# Create a data frame for each species of iris
setosa = iris_df[iris_df['Species'] == 0]
versicolor = iris_df[iris_df['Species'] == 1]
virginica = iris_df[iris_df['Species'] == 2]
# Plot
width = 0.1
# Count the number of occurrences of each sepal width for each species
uniq, cnts = np.unique(setosa['sepal width (cm)'], return_counts=1)
plt.bar(uniq, cnts, width, color='C0')
uniq, cnts = np.unique(versicolor['sepal width (cm)'], return_counts=1)
plt.bar(uniq, cnts, width, color='C2')
uniq, cnts = np.unique(virginica['sepal width (cm)'], return_counts=1)
plt.bar(uniq, cnts, width, color='C3')
# Set labels
plt.title('The Widths of the Sepals of 150 Iris Flowers')
plt.xlabel('Width (cm)')
plt.ylabel('Count')
plt.show()
This has worked but there are two immediate problems:
Let’s fix the first of those problems first:
By reducing the width of each bar to a third of its original width and offsetting each group of data, we can separate the bars as follows:
width = 0.1 / 3
# Count the number of occurrences of each sepal width for each species
uniq, cnts = np.unique(setosa['sepal width (cm)'], return_counts=1)
plt.bar(uniq, cnts, width, color='C0')
uniq, cnts = np.unique(versicolor['sepal width (cm)'], return_counts=1)
plt.bar(uniq + width, cnts, width, color='C2')
uniq, cnts = np.unique(virginica['sepal width (cm)'], return_counts=1)
plt.bar(uniq + 2 * width, cnts, width, color='C3')
# Set labels
plt.title('The Widths of the Sepals of 150 Iris Flowers')
plt.xlabel('Width (cm)')
plt.ylabel('Count')
plt.show()
The easiest way to indicate which group is which is to use a legend:
width = 0.1 / 3
# Count the number of occurrences of each sepal width for each species
uniq, cnts = np.unique(setosa['sepal width (cm)'], return_counts=1)
bar0 = plt.bar(uniq, cnts, width, color='C0')
uniq, cnts = np.unique(versicolor['sepal width (cm)'], return_counts=1)
bar1 = plt.bar(uniq + width, cnts, width, color='C2')
uniq, cnts = np.unique(virginica['sepal width (cm)'], return_counts=1)
bar2 = plt.bar(uniq + 2 * width, cnts, width, color='C3')
# Set labels
plt.title('The Widths of the Sepals of 150 Iris Flowers')
plt.xlabel('Width (cm)')
plt.ylabel('Count')
# Legend
plt.legend(
(bar0[0], bar1[0], bar2[0]), ('Setosa', 'Versicolor', 'Virginica')
)
plt.show()
Instead of plotting the width of every single sepal from every single flower, we can group all the sepals in one group together and only plot the means:
width = 0.5
# Plot the average sepal width for each species
bar0 = plt.bar(0, setosa['sepal width (cm)'].mean(), width, color='C0')
bar1 = plt.bar(1, versicolor['sepal width (cm)'].mean(), width, color='C2')
bar2 = plt.bar(2, virginica['sepal width (cm)'].mean(), width, color='C3')
# Set labels
plt.title('Average Sepal Width for Each Species of Iris')
plt.xlabel('')
plt.ylabel('Average Width (cm)')
# Legend
plt.gca().legend(
(bar0[0], bar1[0], bar2[0]), ('Setosa', 'Versicolor', 'Virginica'),
fontsize=8, loc='center left', bbox_to_anchor=(1, 0.5)
)
plt.subplots_adjust(right=0.82)
plt.show()
Note that, in order to fit the legend in, we had to adjust the size of the plot area with plt.subplots_adjust(right=0.82)
An important thing to realise at this point is that the first examples used ‘infinite bins’: each sepal width had a bar corresponding to it regardless of how many values there were. This previous example, however, is the opposite in that it only has one bin for each iris type. Essentially, all the bins have been compressed into one for each group. Later we will create plots that have a custom number of bins, between one and infinity.
Now that the x-axis corresponds to the groups as opposed to values, it looks strange to have the scale on that axis be a continuous variable. Replace it with the group names:
import matplotlib.ticker as ticker
# Plot
width = 0.5
# Plot the average sepal width for each species, including standard deviation
data0 = setosa['sepal width (cm)']
plt.bar(0, data0.mean(), width, color='C0')
data1 = versicolor['sepal width (cm)']
plt.bar(1, data1.mean(), width, color='C2')
data2 = virginica['sepal width (cm)']
plt.bar(2, data2.mean(), width, color='C3')
# Set labels
plt.title('Average Sepal Width for Each Species of Iris')
plt.xlabel('Species')
plt.ylabel('Average Width (cm)')
# x-Axis details
plt.xticks([0, 1, 2], ['Setosa', 'Versicolor', 'Virginica'])
plt.gca().tick_params(axis='x', which='major', length=0)
plt.gca().tick_params(axis='x', which='minor', length=4)
xlocs = np.arange(4) - 0.5
plt.gca().xaxis.set_minor_locator(ticker.FixedLocator(xlocs))
plt.show()
The fact that each bar represents a number of values means that we can calculate a standard deviation for each, and represent that uncertainty on the graph as error bars:
# Plot
width = 0.5
# Plot the average sepal width for each species, including standard deviation
data0 = setosa['sepal width (cm)']
plt.bar([0], data0.mean(), width, color='C0', yerr=data0.std())
data1 = versicolor['sepal width (cm)']
plt.bar([1], data1.mean(), width, color='C2', yerr=data1.std())
data2 = virginica['sepal width (cm)']
plt.bar([2], data2.mean(), width, color='C3', yerr=data2.std())
# Set labels
plt.title('Average Sepal Width for Each Species of Iris')
plt.xlabel('Species')
plt.ylabel('Average Width (cm)')
# x-Axis details
plt.xticks([0, 1, 2], ['Setosa', 'Versicolor', 'Virginica'])
plt.gca().tick_params(axis='x', which='major', length=0)
plt.gca().tick_params(axis='x', which='minor', length=4)
xlocs = np.arange(4) - 0.5
plt.gca().xaxis.set_minor_locator(ticker.FixedLocator(xlocs))
plt.show()
Instead of error bars, the spread of the data can be communicated by showing the data points themselves on top of the bars:
# Plot
width = 0.5
# Separate out the three datasets
data0 = setosa['sepal width (cm)']
data1 = versicolor['sepal width (cm)']
data2 = virginica['sepal width (cm)']
# Plot the individual points
plt.scatter([0] * len(data0), data0, color='k')
plt.scatter([1] * len(data1), data1, color='k')
plt.scatter([2] * len(data2), data2, color='k')
# Plot the average sepal width for each species
plt.bar(0, data0.mean(), width, color='C0', zorder=0)
plt.bar(1, data1.mean(), width, color='C2', zorder=0)
plt.bar(2, data2.mean(), width, color='C3', zorder=0)
# Set labels
plt.title('Average Sepal Width for Each Species of Iris')
plt.xlabel('Species')
plt.ylabel('Average Width (cm)')
# x-Axis details
plt.xticks([0, 1, 2], ['Setosa', 'Versicolor', 'Virginica'])
plt.gca().tick_params(axis='x', which='major', length=0)
plt.gca().tick_params(axis='x', which='minor', length=4)
xlocs = np.arange(4) - 0.5
plt.gca().xaxis.set_minor_locator(ticker.FixedLocator(xlocs))
plt.show()
Let’s show the height of each bar as a number on the plot itself:
# Plot
width = 0.5
# Separate out the three datasets
data0 = setosa['sepal width (cm)']
data1 = versicolor['sepal width (cm)']
data2 = virginica['sepal width (cm)']
# Plot the average sepal width for each species
b0 = plt.bar(0, data0.mean(), width, color='C0', zorder=0)
b1 = plt.bar(1, data1.mean(), width, color='C2', zorder=0)
b2 = plt.bar(2, data2.mean(), width, color='C3', zorder=0)
# Annotate the values above each bar
for bar in [b0, b1, b2]:
height = bar.patches[0].get_height()
plt.text(
bar.patches[0].get_x() + bar.patches[0].get_width() / 2, 1.05 * height,
f'{height:4.2f}', ha='center', va='bottom'
)
# Increase the height of the plot by 15% to accommodate the labels
bottom, top = plt.ylim()
plt.ylim(bottom, top * 1.15)
# Set labels
plt.title('Average Sepal Width for Each Species of Iris')
plt.ylabel('Average Width (cm)')
# x-Axis details
plt.xticks(np.arange(3), ['Setosa', 'Versicolor', 'Virginica'])
plt.gca().tick_params(axis='x', which='major', length=0)
plt.gca().tick_params(axis='x', which='minor', length=4)
xlocs = np.arange(3 + 1) - 0.5
plt.gca().xaxis.set_minor_locator(ticker.FixedLocator(xlocs))
plt.show()
This time we will stratify the data into long (6.5+ cm), medium (5.5 to 6.5 cm) and short (0 to 5.5 cm) sepals and plot the average sepal width for each group for each stratification:
# Separate out the three datasets
discriminator = 'sepal length (cm)'
data0 = [
setosa[setosa[discriminator] < 5.5]['sepal width (cm)'].mean(),
setosa[(setosa[discriminator] >= 5.5) & (setosa[discriminator] < 6.5)]['sepal width (cm)'].mean(),
setosa[setosa[discriminator] >= 6.5]['sepal width (cm)'].mean()
]
data1 = [
versicolor[versicolor[discriminator] < 5.5]['sepal width (cm)'].mean(),
versicolor[(versicolor[discriminator] >= 5.5) & (versicolor[discriminator] < 6.5)]['sepal width (cm)'].mean(),
versicolor[versicolor[discriminator] >= 6.5]['sepal width (cm)'].mean()
]
data2 = [
virginica[virginica[discriminator] < 5.5]['sepal width (cm)'].mean(),
virginica[(virginica[discriminator] >= 5.5) & (virginica[discriminator] < 6.5)]['sepal width (cm)'].mean(),
virginica[virginica[discriminator] >= 6.5]['sepal width (cm)'].mean()
]
# Plot
width = 0.3
bar0 = plt.bar(np.arange(len(data0)) - width, data0, width, color='C0')
bar1 = plt.bar(np.arange(len(data1)), data1, width, color='C2')
bar2 = plt.bar(np.arange(len(data2)) + width, data2, width, color='C3')
# Set labels
plt.title('Average Sepal Width Given the Sepal Length')
plt.xlabel('Sepal Length (cm)')
plt.ylabel('Average Sepal Width (cm)')
# x-Axis details
plt.xticks(np.arange(3), ['0 to 5.5', '5.5 to 6.5', '6.5+'])
plt.gca().tick_params(axis='x', which='major', length=0)
plt.gca().tick_params(axis='x', which='minor', length=4)
xlocs = np.arange(len(data1) + 1) - 0.5
plt.gca().xaxis.set_minor_locator(ticker.FixedLocator(xlocs))
plt.xlim(min(xlocs), max(xlocs))
# Legend
plt.gca().legend(
(bar0[0], bar1[0], bar2[0]), ('Setosa', 'Versicolor', 'Virginica'),
fontsize=8, loc='center left', bbox_to_anchor=(1, 0.5)
)
plt.show()
As before, let’s add the height above each bar:
# Separate out the three datasets
discriminator = 'sepal length (cm)'
data0 = [
setosa[setosa[discriminator] < 5.5]['sepal width (cm)'].mean(),
setosa[(setosa[discriminator] >= 5.5) & (setosa[discriminator] < 6.5)]['sepal width (cm)'].mean(),
setosa[setosa[discriminator] >= 6.5]['sepal width (cm)'].mean()
]
data1 = [
versicolor[versicolor[discriminator] < 5.5]['sepal width (cm)'].mean(),
versicolor[(versicolor[discriminator] >= 5.5) & (versicolor[discriminator] < 6.5)]['sepal width (cm)'].mean(),
versicolor[versicolor[discriminator] >= 6.5]['sepal width (cm)'].mean()
]
data2 = [
virginica[virginica[discriminator] < 5.5]['sepal width (cm)'].mean(),
virginica[(virginica[discriminator] >= 5.5) & (virginica[discriminator] < 6.5)]['sepal width (cm)'].mean(),
virginica[virginica[discriminator] >= 6.5]['sepal width (cm)'].mean()
]
# Plot with an adjusted shape to accommodate the legend
width = 0.3
bar0 = plt.bar(np.arange(len(data0)) - width, data0, width, color='C0')
bar1 = plt.bar(np.arange(len(data1)), data1, width, color='C2')
bar2 = plt.bar(np.arange(len(data2)) + width, data2, width, color='C3')
# Annotate the values above each bar
for bar in [bar0, bar1, bar2]:
for patch in bar.patches:
height = patch.get_height()
plt.text(
patch.get_x() + patch.get_width() / 2, 1.05 * height,
f'{height:4.2f}', ha='center', va='bottom'
)
# Increase the height of the plot by 10% to accommodate the labels
bottom, top = plt.ylim()
plt.ylim(bottom, top * 1.10)
# Set labels
plt.title('Average Sepal Width Given the Sepal Length')
plt.xlabel('Sepal Length (cm)')
plt.ylabel('Average Sepal Width (cm)')
# x-Axis details
plt.xticks(np.arange(3), ['0 to 5.5', '5.5 to 6.5', '6.5+'])
plt.gca().tick_params(axis='x', which='major', length=0)
plt.gca().tick_params(axis='x', which='minor', length=4)
xlocs = np.arange(len(data1) + 1) - 0.5
plt.gca().xaxis.set_minor_locator(ticker.FixedLocator(xlocs))
plt.xlim(min(xlocs), max(xlocs))
# Legend
plt.gca().legend(
(bar0[0], bar1[0], bar2[0]), ('Setosa', 'Versicolor', 'Virginica'),
fontsize=8, loc='center left', bbox_to_anchor=(1, 0.5)
)
plt.show()
Instead of using a legend, we can show that we are plotting different groups by having two y-axes that correspond to the colour of the group they refer to:
# Plot with an adjusted shape to accommodate the legend
width = 0.3
plt.bar(np.arange(len(data1)) - 0.15, data1, width, color='C0')
# Set labels
plt.title('Average Sepal Width Given the Sepal Length')
plt.xlabel('Sepal Length (cm)')
plt.ylabel('Average Sepal Width (cm) - Versicolor', color='C0')
plt.gca().tick_params(axis='y', which='major', color='C0')
plt.yticks(color='C0')
# x-Axis details
plt.xticks(np.arange(3), ['0 to 5.5', '5.5 to 6.5', '6.5+'])
plt.gca().tick_params(axis='x', which='major', length=0)
plt.gca().tick_params(axis='x', which='minor', length=4)
xlocs = np.arange(len(data1) + 1) - 0.5
plt.gca().xaxis.set_minor_locator(ticker.FixedLocator(xlocs))
plt.xlim(min(xlocs), max(xlocs))
# Second y-axis
plt.twinx()
plt.bar(np.arange(len(data2)) + 0.15, data2, width, color='C3')
plt.ylabel('Average Sepal Width (cm) - Virginica', color='C3')
plt.gca().tick_params(axis='y', which='major', color='C3')
plt.yticks(color='C3')
plt.show()