For this example we’ll use a ‘toy dataset’ (a small, freely available dataset which is realistic enough to be useful but not detailed enough to be academically insightful). Specifically, we’ll use the diabetes dataset which contains data from 442 diabetes patients (more information here and here. Download this from the internet as tab–separated values and convert it into a Pandas data frame as follows:
import pandas as pd
# Download the tab-separated values from the internet
df = pd.read_csv('https://www4.stat.ncsu.edu/~boos/var.select/diabetes.tab.txt', sep='\t')
print(df.head())
## AGE SEX BMI BP S1 S2 S3 S4 S5 S6 Y
## 0 59 2 32.1 101.0 157 93.2 38.0 4.0 4.8598 87 151
## 1 48 1 21.6 87.0 183 103.2 70.0 3.0 3.8918 69 75
## 2 72 2 30.5 93.0 156 93.6 41.0 4.0 4.6728 85 141
## 3 24 1 25.3 84.0 198 131.4 40.0 5.0 4.8903 89 206
## 4 50 1 23.0 101.0 192 125.4 52.0 4.0 4.2905 80 135
For this example, we’ll only use the age, BMI and ‘S6’ columns. According to the documentation for this dataset, column S6 represents a measure of blood sugar called ‘glu’ which I think is meant to be fasting blood sugar (measured in mg/dL) although, given that these patients are diabetic, the numbers are unexpectedly low. We’ll use them anyway:
# Rename
df = df.rename(columns={'S6': 'GLU'})
# Assume the variable is fasting blood sugar, measured in mg/dL
# Trim
cols = ['AGE', 'BMI', 'GLU']
df = df[cols]
print(df.head())
## AGE BMI GLU
## 0 59 32.1 87
## 1 48 21.6 69
## 2 72 30.5 85
## 3 24 25.3 89
## 4 50 23.0 80
We’re going to see if there is a relationship between blood sugar level and BMI of these diabetic patients. To do that, we need to classify the data into groups using the BMI values:
This is done below in 3 steps:
# Categorise the BMI data
bins = [0, 25, 30, df['BMI'].max() + 1]
labels = ['Normal', 'Overweight', 'Obese']
df['category'] = pd.cut(df['BMI'], bins, labels=labels, right=False)
print(df.head())
## AGE BMI GLU category
## 0 59 32.1 87 Obese
## 1 48 21.6 69 Normal
## 2 72 30.5 85 Obese
## 3 24 25.3 89 Overweight
## 4 50 23.0 80 Normal
Let’s see how many people in each category have high blood sugar levels (≥90 mg/dL in this example, in real life this is well within normal):
# Find the number of ppts with glu >= 90 mg/dL
print(df[(df['category'] == 'Normal') & (df['GLU'] >= 90)].shape[0])
print(df[(df['category'] == 'Overweight') & (df['GLU'] >= 90)].shape[0])
print(df[(df['category'] == 'Obese') & (df['GLU'] >= 90)].shape[0])
## 80
## 100
## 71
For each category, we want to know the mean glu value and the number of patients. This can be done by pivoting the table:
# Calculate the mean of each category
df = pd.pivot_table(df, values='GLU', aggfunc=['mean', 'count'], index='category')
print(df)
## mean count
## GLU GLU
## category
## Normal 87.021277 188
## Overweight 92.387097 155
## Obese 97.545455 99
Remove the ‘GLU’ level:
# Remove the 'GLU' level
df.columns = df.columns.droplevel(level=1)
print(df)
## mean count
## category
## Normal 87.021277 188
## Overweight 92.387097 155
## Obese 97.545455 99
Now we can create the bar plot!
A bar plot can be created directly from a Pandas data frame using the plot.bar()
method:
use_index=True
parametery='mean'
rot=0
color
keyword argumentimport matplotlib.pyplot as plt
# Create bar plot
ax = df.plot.bar(use_index=True, y='mean', rot=0, color='#95b8d1')
plt.show()
Let’s make the following changes:
plt.rc('figure', figsize=...)
plt.rc('text', usetex=True)
plt.rc('font', family='serif')
plt.rc('text.latex', preamble=r'\usepackage{textgreek}')
set_title()
,d
)\footnotesize
set_xlabel()
and set_ylabel()
set_ylim()
legend().set_visible(False)
# Matplotlib settings
A = 5 # Want figures to be A5
plt.rc('figure', figsize=[46.82 * .5**(.5 * A), 33.11 * .5**(.5 * A)])
plt.rc('text', usetex=True)
plt.rc('font', family='serif')
plt.rc('text.latex', preamble=r'\usepackage{textgreek}')
# Create bar plot
ax = df.plot.bar(use_index=True, y='mean', rot=0, color='#95b8d1')
# Format the plot
total_sample_size = df['count'].sum()
ax.set_title(
'Blood Sugar Levels of Diabetic Patients\n' +
rf'\footnotesize (n = {total_sample_size:,d})'
)
ax.set_xlabel('')
ax.set_ylabel(r'Mean Fasting Blood Sugar (1000 \textmu g/dL)')
ax.set_ylim([80, 100])
ax.legend().set_visible(False)
plt.show()
To make annotations outside the plot area, use the text()
function and specify the y-position of the annotations to be outside the range of the plot:
# Set the keyword arguments for the text labels
kwargs = {'ha': 'center', 'va': 'center', 'size': 'small'}
# Annotate the sample size below each label
for i, bar in enumerate(ax.patches):
# For each bar, add text that contains its sample size
sample_size = df['count'].to_list()[i]
ax.text(
bar.get_x() + bar.get_width() / 2, 78.5,
f'n = {sample_size:d}', **kwargs
)
You could separate sections of the bars out to represent their height above 90 mg/dL:
# Add bar plots of different colour
x_positions = [0, 1, 2]
heights = [ax.patches[0].get_height(), 90, 90]
width = 0.5
ax.bar(x_positions, heights, width, color='#809bce')
# Add horizontal line
ax.plot([-1, 2.5], [90, 90], 'k--', lw=0.8)
To make annotations inside the plot area, use the text()
function and specify the y-positions of the annotations to be inside the range of the plot:
# Add text of number who have glu >= 90 mg/dL
ax.text(1, (92.39 - 90) / 2 + 90, 'n = 100', **kwargs)
ax.text(1, 85, 'n = 55', **kwargs)
ax.text(2, (97.53 - 90) / 2 + 90, 'n = 71', **kwargs)
ax.text(2, 85, 'n = 28', **kwargs)
Use plt.savefig('name_of_plot.png')
to save the plot to your computer. The .png, .jpg and .pdf extensions should work, as well as others. Alternatively, display the plot in a pop-up window with plt.show()
. If you are plotting more than one figure in the same Python script use plt.figure()
and plt.close()
before and after each, respectively, in order to tell Python when one plot ends and the next one starts.