This page shows some ways of performing descriptive statistics in Python. Descriptive statistics are statistics that:
Descriptive statistics are different to inferential statistics, which are discussed at the end of this page.
The code on this page uses the Pandas, Mimesis, Matplotlib, NumPy and SciPy packages. These can be installed from the terminal with:
$ python3.11 -m pip install pandas
$ python3.11 -m pip install mimesis
$ python3.11 -m pip install matplotlib
$ python3.11 -m pip install numpy
$ python3.11 -m pip install scipy
Replace python3.11
with the version of Python you are using. Once installed, these packages can be imported into your Python script via the following:
import pandas as pd
import mimesis
from matplotlib import pyplot as plt
import numpy as np
from scipy import stats as st
Additionally, the Standard Library module “random” is used. This does not need to be installed but does need to be imported:
import random
This page will use fake data for its examples. This can be created in Python as follows:
# Set the 'seed' to ensure that the random data is the same each time we run the code
random.seed(20221229)
# Create fake data generators
person = mimesis.Person(seed=20221229)
development = mimesis.Development(seed=20221229)
# Initialize a data frame
df = pd.DataFrame()
# Populate the data frame with 120 rows of fake data
for _ in range(120):
previous_donor = development.boolean()
new_row = {
'name': person.full_name(),
'age': person.age(),
'blood_type': person.blood_type(),
'previous_donor': previous_donor,
# If the person has not donated previously, ensure that "times_donated" is zero
'times_donated': random.randint(1, 5) * previous_donor,
# View on non-reimbursed blood donation
'view': person.views_on(),
}
new_row = pd.DataFrame(new_row, index=[1])
df = pd.concat([df, new_row], ignore_index=True)
# Take a look
print(df.head())
## name age blood_type previous_donor times_donated view
## 0 Joana Ayers 41 AB+ False 0 Compromisable
## 1 Milford Gaines 17 B− True 2 Very negative
## 2 Son Vazquez 43 O− True 1 Compromisable
## 3 Corliss Shepherd 20 O+ False 0 Positive
## 4 Ramonita Huffman 41 B− True 2 Positive
We have created a data set that simulates a blood donation drive that had 120 donors. It includes columns that have five different data types:
blood_type
previous_donor
view
on non-reimbursed blood donationtimes_donated
- the number of times each donor has donated blood before; only whole numbers are possibleage
. In this particular data set these values are whole numbers but a person’s age doesn’t have to be a whole number; it can be a fraction in between two whole numbers and thus ‘age’ is a continuous variable.Now, how can we calculate and present information that describes this data set? We can do so using descriptive statistics.
Let’s describe the categorical variables: blood_type
, previous_donor
and view
. For starters, the sample size of each variable is equal to the length of the data frame:
# Sample size
n = len(df)
print(f'Sample size, n = {n}')
## Sample size, n = 120
The counts of the nominal data (blood_type
) are:
# Counts
counts = df['blood_type'].value_counts()
print(counts)
## blood_type
## AB− 20
## B− 18
## O− 18
## B+ 15
## O+ 14
## A− 13
## AB+ 12
## A+ 10
## Name: count, dtype: int64
…and the percentages are:
# Percentage
percentages = df['blood_type'].value_counts() / n * 100
print(f'Percent of donors with O+ blood: {percentages["O+"]:4.1f}%')
## Percent of donors with O+ blood: 11.7%
Repeating this for the binary data (previous_donor
):
# Counts
counts = df['previous_donor'].value_counts()
# Percentage
percentages = counts / n * 100
print(percentages.round(1))
## previous_donor
## False 50.8
## True 49.2
## Name: count, dtype: float64
…and the ordinal data (view
):
# Counts
counts = df['view'].value_counts()
# Percentage
percentages = counts / n * 100
print(percentages.round(1))
## view
## Positive 23.3
## Negative 22.5
## Compromisable 20.0
## Neutral 17.5
## Very negative 16.7
## Name: count, dtype: float64
Categorical variables can be descriptively presented in frequency tables and bar plots.
A one-way frequency table can be used to show the counts and percentages of the nominal blood_type
data more completely:
# Counts
counts = df['blood_type'].value_counts().rename('count')
# Sort the columns alphabetically
counts = counts.reindex(sorted(counts.index), axis=1)
# Percentages
percentages = (counts / n * 100).rename('%')
# Merge the counts and percentages into one table
frequency_table = pd.merge(counts, percentages, left_index=True, right_index=True)
# Calculate the totals
totals = frequency_table.sum(axis=0).rename('total')
# Add the totals as a new column
frequency_table = pd.concat([frequency_table.T, totals], axis=1).round(1)
print(frequency_table)
## A+ AB+ AB− A− B+ B− O+ O− total
## count 10.0 12.0 20.0 13.0 15.0 18.0 14.0 18.0 120.0
## % 8.3 10.0 16.7 10.8 12.5 15.0 11.7 15.0 100.0
The best way to show this data on a graph would be to make a single-group bar plot. More information on creating these can be found on this page and this page.
# Counts
counts = df['blood_type'].value_counts(sort=False)
# Create bar plot
ax = counts.plot.bar('blood_type', rot=0, color='#95b8d1')
# Labels
ax.set_title('Displaying Nominal Data in a Bar Plot')
ax.set_ylabel('Count')
ax.set_xlabel('Blood Type')
plt.show()
Use a two-way frequency table to show two variables at once. In this example they are the binary and the ordinal variables (previous_donor
and view
):
# Create a cross-tabulation of two variables
ct = pd.crosstab(df['previous_donor'], df['view'])
# Re-order the columns
ct = ct[['Very negative', 'Negative', 'Neutral', 'Compromisable', 'Positive']]
# Calculate the total counts and add them to the cross-tabulation
totals = ct.sum(axis=0).rename('total')
ct = pd.concat([ct.T, totals], axis=1).round(1)
totals = ct.sum(axis=0).rename('total')
ct = pd.concat([ct.T, totals], axis=1).round(1)
print(ct)
## Very negative Negative Neutral Compromisable Positive total
## False 6 15 15 11 14 61
## True 14 12 6 13 14 59
## total 20 27 21 24 28 120
# Calculate the percentages
ct = (ct / n * 100).round(1)
print(ct)
## Very negative Negative Neutral Compromisable Positive total
## False 5.0 12.5 12.5 9.2 11.7 50.8
## True 11.7 10.0 5.0 10.8 11.7 49.2
## total 16.7 22.5 17.5 20.0 23.3 100.0
The best way to show two categorical variables on a graph at once is to make a multi-group bar plot. More information on creating these can be found on this page and this page.
# Create a cross-tabulation of two variables
ct = pd.crosstab(df['previous_donor'], df['view'])
# Re-order the columns
ct = ct[['Very negative', 'Negative', 'Neutral', 'Compromisable', 'Positive']]
# Plot with an adjusted shape to accommodate the legend
ax = plt.axes([0.1, 0.1, 0.7, 0.84])
# Create bar plot
ct.T.plot.bar(rot=0, color=['#95b8d1', '#800000'], ax=ax)
# Labels
ax.set_title('Displaying Binary and Ordinal Data in a Bar Plot')
ax.set_ylabel('Count')
ax.set_xlabel('Views on Non-Reimbursed Blood Donation')
# Legend
plt.gca().legend(
title='Previous Donor', fontsize=8,
loc='center left', bbox_to_anchor=(1, 0.5)
)
plt.show()
As it stands, Python doesn’t realise that the ordinal data (the views
column) has an inherent order. Logically, a Neutral
viewpoint should exist between a Negative
and a Positive
viewpoint - they are ordered. Fortunately, Python has a special ‘Categorical’ data type that enables both a name and an order to be stored, and the view
column can be converted to Categorical via the .Categorical()
method as shown below:
order = ['Very negative', 'Negative', 'Neutral', 'Compromisable', 'Positive']
df['view'] = pd.Categorical(df['view'], order, ordered=True)
print(df['view'].unique())
## ['Compromisable', 'Very negative', 'Positive', 'Neutral', 'Negative']
## Categories (5, object): ['Very negative' < 'Negative' < 'Neutral' < 'Compromisable' < 'Positive']
Now let’s describe the two quantitative variables: times_donated
(which are discrete values) and age
(which are continuous values).
The counts of the discrete data are:
# Counts
counts = df['times_donated'].value_counts()
# Sort the columns alphabetically
counts = counts.reindex(sorted(counts.index), axis=1)
print(counts)
## times_donated
## 0 61
## 1 12
## 2 9
## 3 11
## 4 12
## 5 15
## Name: count, dtype: int64
The percentages of the discrete data are:
# Percentage
percentages = (counts / n * 100).round(1)
print(percentages)
## times_donated
## 0 50.8
## 1 10.0
## 2 7.5
## 3 9.2
## 4 10.0
## 5 12.5
## Name: count, dtype: float64
The counts and percentages of the continuous data can be shown in a frequency distribution whereby the values are grouped into bins. The histogram
function from the Numpy package is useful for this (download and install the Numpy package from the terminal with python3.11 -m pip install numpy
):
# Create data
counts, bins = np.histogram(df['age'])
print(counts)
## [14 9 11 16 12 17 14 10 11 6]
print(bins)
## [16. 21. 26. 31. 36. 41. 46. 51. 56. 61. 66.]
These can then be tabulated:
intervals = [f'{int(v)} to {int(bins[i +1])}' for i, v in enumerate(bins[:-1])]
dct = {
'Interval (bin)': intervals,
'Frequency': counts,
'%': (counts / n * 100).round(1),
'Cumulative %': (counts / n * 100).cumsum().round(1)
}
frequency_distribution = pd.DataFrame(dct)
print(frequency_distribution)
## Interval (bin) Frequency % Cumulative %
## 0 16 to 21 14 11.7 11.7
## 1 21 to 26 9 7.5 19.2
## 2 26 to 31 11 9.2 28.3
## 3 31 to 36 16 13.3 41.7
## 4 36 to 41 12 10.0 51.7
## 5 41 to 46 17 14.2 65.8
## 6 46 to 51 14 11.7 77.5
## 7 51 to 56 10 8.3 85.8
## 8 56 to 61 11 9.2 95.0
## 9 61 to 66 6 5.0 100.0
The data can be plotted in this format in a histogram (see more on this page and this page):
ax = df['age'].plot.hist(bins=11, color='#95b8d1')
ax.set_title('Histogram of Continuous Data')
ax.set_xlabel('Age [yrs]')
ax.set_xlim(16, 66)
plt.show()
An alternative to the above would be to instead describe the data with quartiles and box-and-whisker plots.
A single group of continuous data can be describe using a five number summary:
# Five number summary
minimum = df['age'].min()
first_quartile = df['age'].quantile(0.25)
median = df['age'].median() # aka the second quartile
third_quartile = df['age'].quantile(0.75)
maximum = df['age'].max()
print(minimum, first_quartile, median, third_quartile, maximum)
## 16 28.0 39.0 49.25 66
The five number summary divides the ordered data points into four equal parts, hence why these parts are known as quartiles. In general, when dividing ordered data points up into equal parts these parts are known as ‘quantiles’; specific examples of quantiles include quartiles (four parts), deciles (ten parts) and percentiles (one hundred parts):
# Percentiles
tenth_percentile = df['age'].quantile(0.1) # 10th percentile
ninetieth_percentile = df['age'].quantile(0.9) # 90th percentile
print(f'10th percentile is {tenth_percentile:4.1f} years; 90th percentile is {ninetieth_percentile:4.1f} years')
## 10th percentile is 20.0 years; 90th percentile is 57.0 years
Another potentially useful descriptive statistic is the range: the largest value minus the smallest:
print(f'The range is {maximum - minimum}')
## The range is 50
The five number summary can be represented in a box-and-whisker plot:
ax = plt.axes()
bp = df.boxplot(column='age', grid=False, return_type='dict', vert=False, ax=ax)
plt.setp(bp['boxes'], color='k')
plt.setp(bp['medians'], color='k')
plt.setp(bp['whiskers'], color='k')
ax.yaxis.set_ticklabels([])
ax.yaxis.set_ticks([])
ax.set_title('Box-and-Whisker Plot of Continuous Data')
ax.set_xlabel('Age [yrs]')
ax.scatter(df['age'], np.ones(n) + 0.03 * np.random.randn(n), s=4)
plt.show()
Viewing data in a histogram or a box-and-whisker plot can help in finding extreme observations (outliers) and asymmetric distributions (skewness). In particular, data points that are further than 1.5 times the length of the box away from the relevant quartile can be considered outliers.
Splitting the data into groups can help in seeing differences or similarities between them.
ax = plt.axes()
bp = df.boxplot(column='age', by='previous_donor', grid=False, return_type='dict', ax=ax)
# Iterate over each box
for box in bp:
plt.setp(box['boxes'], color='k')
plt.setp(box['medians'], color='k')
plt.setp(box['whiskers'], color='k')
plt.suptitle('')
ax.set_title('Box-and-Whisker Plot of Continuous Data')
ax.set_ylabel('Age [yrs]')
ax.set_xlabel('Previous Donor')
plt.show()
The mean is based on the values of the data points and is usually more representative when the data is not skewed. Use “\(\mu\)” (the lowercase Greek letter ‘mu’) for the population mean (eg if you are doing inferential statistics) and “\(\bar{x}\)” (pronounced ‘x bar’) for the sample mean (eg if you are doing descriptive statistics):
# Mean of a Pandas series
x_bar = df['age'].mean()
print(f'The mean age is {x_bar:4.1f} years')
## The mean age is 39.2 years
This can also be done with NumPy’s mean()
function:
# Mean of an array-like object
x_bar = np.mean(df['age'])
print(f'The mean age is {x_bar:4.1f} years')
## The mean age is 39.2 years
The median is based on the ranks of the data points and is usually more representative when the data is skewed:
# Median of a Pandas series
median = df['times_donated'].median()
print(f'The median number of previous donations is {int(median)}')
## The median number of previous donations is 0
This can also be done with NumPy’s median()
function:
# Median of an array-like object
median = np.median(df['times_donated'])
print(f'The median number of previous donations is {int(median)}')
## The median number of previous donations is 0
The mode is based on the frequency at which data points appear and is usually more representative when the data is nominal:
# Mode of a Pandas series
mode = df['blood_type'].mode()
# If there are multiple modes they will all be returned. If this is the case,
# let's just take the first
mode = mode[0]
print(f'The most common blood type in this sample is {mode}')
## The most common blood type in this sample is AB−
For the record, in real life, AB− is the LEAST COMMON blood type!
The mode can also be found with SciPy’s mode()
function, although this only works with numeric data:
# Mode of an array-like object of numerics
mode, count = st.mode(df['age'])
print(f'The most common age in this sample is {mode}')
## The most common age in this sample is 43
The dispersion within a set of data points can be described and visualized using:
The standard deviation is the square root of the variance. The variance is the “mean squared distance of the data points from the mean” - in other words it’s a measure of how dispersed the data is around the mean in units that are the same as that of the data itself but squared. By implication, the units of the standard deviation are exactly the same as that of the data itself. So the variance is meaningful as a measure of variability but the standard deviation is useful because it has the same units as the data. An example of this usefulness is the fact that we can make statements like “about 95% of the data is found within two standard deviations either side of the sample mean, assuming a normal distribution” (see the section on Confidence Levels below), which is only possible because the units of the standard deviation, the mean and the data itself are all the same.
Use “\(\sigma\)” (the lowercase Greek letter ‘sigma’) for population standard deviation (eg if you are doing inferential statistics) and “\(s\)” for sample standard deviation (eg if you are doing descriptive statistics). Population standard deviation is calculated using a delta degrees of freedom (“ddof
”) value of 0 while sample standard deviation has a ddof
value of 1:
# Population standard deviation
σ = df['age'].std(ddof=0)
print(σ)
## 13.280308626768514
# Sample standard deviation
s = df['age'].std(ddof=1)
print(f'The mean age is {x_bar:4.1f} years with a sample standard deviation of {s:4.2f} years')
## The mean age is 39.2 years with a sample standard deviation of 13.34 years
The standard deviation is based on the data points’ values and gets reported with the mean (as has been done above). This is in contrast to the interquartile range which is based on the data points’ ranks and gets reported with the median.
As mentioned above:
Population standard deviation is calculated using a delta degrees of freedom (“
ddof
”) value of 0 while sample standard deviation has addof
value of 1
This is programmers’ way of saying that the formula for the population standard deviation involves dividing by \(n\) (the sample size) whereas the formula for the sample standard deviation involves dividing by \(n - 1\). The ‘ddof’ value is how much gets subtracted from the sample size in the formula that you want Python to use.
Broadly speaking, the degrees of freedom of a set of numbers is the sample size minus the number of pieces of information known about the numbers. For example, if someone asked you to choose 8 numbers then you would have the freedom to choose 8 values of your pleasing. However, if someone asked you to choose 8 numbers that summed to \(x\) then you would only have the freedom to choose 7 values because the 8th value would have to be such that the overall total was \(x\). Similarly, if someone asked you to choose 8 numbers that summed to \(x\) and had a product of \(y\) then you would only be able to choose 6 values freely; the 7th and 8th values would need to be such that the total sum and product were correct.
When calculating the sample standard deviation the mean of the values is necessarily calculated first, hence one piece of information about the numbers is created and the number of degrees of freedom decreases by one as a result. Thus, the change in the number of degrees of freedom is 1 and, because ‘delta’ is used in maths to mean ‘change’, the delta degrees of freedom (‘ddof’) is 1. When calculating the population standard deviation, however, considerations about the number of degrees of freedom is irrelevant (the ddof is 0): you have all the information that exists about this metric in this population and so you are calculating exact values.
This idea of subtracting 1 from the sample size when calculating sample variance (and hence when calculating sample standard deviation) is known as “Bessel’s correction”. To quote Wikipedia: “it corrects the bias in the estimation of the population variance, and some, but not all of the bias in the estimation of the population standard deviation”. See the pages on Bessel’s correction and on the Unbiased estimation of standard deviation for more information.
A useful rule-of-thumb is that about 95% of normally-distributed data is found within two standard deviations either side of the sample mean (see the Wikipedia page on this “68–95–99.7 rule”). For the sake of interest, here’s how to calculate this:
# Number of standard deviations
z_critical = 2
# Cumulative distribution function of the normal distribution
phi = st.norm.cdf(z_critical)
# Number of tails
tails = 2
# Significance level
alpha = (1 - phi) * tails
# Confidence level
c = (1 - alpha) * 100
print(f'{c}% of normally-distributed data is found within two standard deviations of the mean')
## 95.44997361036415% of normally-distributed data is found within two standard deviations of the mean
The reverse of the above statement is possibly also interesting: “95% of normally-distributed data is found within about 1.96 standard deviations either side of the sample mean”. Again, here’s how to calculate this:
# Confidence level
c = 95 # %
# Significance level
alpha = 1 - (c / 100)
# Number of tails
tails = 2
# Number of standard deviations
z_critical = st.norm.ppf(1 - alpha / tails)
print(f'95% of normally-distributed data is found within {z_critical} standard deviations of the mean')
## 95% of normally-distributed data is found within 1.959963984540054 standard deviations of the mean
Dot plots are a good way to visualize dispersion (or the lack thereof) when the data is not skewed:
# Create axes
ax = plt.axes()
# Create scatter plot
ax.scatter(df['age'], np.ones(n) + 0.1 * np.random.randn(n), s=4)
# Create straight lines
ax.axhline(1, c='k')
ax.axvline(x_bar - s, 0.4, 0.6, c='k', ls='--')
ax.axvline(x_bar, 0.25, 0.75, c='k', ls='--')
ax.axvline(x_bar + s, 0.4, 0.6, c='k', ls='--')
# Axes' customisation
ax.set_title('Dot Plot of Continuous Data')
ax.yaxis.set_ticklabels([])
ax.yaxis.set_ticks([])
ax.set_ylim([0, 2])
ax.spines['top'].set_visible(False)
ax.spines['right'].set_visible(False)
ax.spines['bottom'].set_visible(False)
ax.spines['left'].set_visible(False)
ax.set_xlabel('Age [yrs]')
plt.show()
As mentioned above, quartiles are useful for describing numerical data and deciding if it is skewed. If it is, it might be decided that the interquartile range is be a better measure of dispersion and variability than the standard deviation. Remember that ‘quartiles’ are a type of ‘quantile’, hence why the .quantile()
method is being used below:
first_quartile = df['age'].quantile(0.25)
second_quartile = df['age'].quantile(0.5)
third_quartile = df['age'].quantile(0.75)
iqr = third_quartile - first_quartile
print(f'The median age is {second_quartile} years with an interquartile range of {iqr} years')
## The median age is 39.0 years with an interquartile range of 21.25 years
In one line of code:
# One-liner
iqr = df['age'].quantile([0.25, 0.75]).diff().iloc[-1]
print(f'The median age is {second_quartile} years with an interquartile range of {iqr} years')
## The median age is 39.0 years with an interquartile range of 21.25 years
…or, using SciPy:
# One-liner
iqr = st.iqr(df['age'])
print(f'The median age is {second_quartile} years with an interquartile range of {iqr} years')
## The median age is 39.0 years with an interquartile range of 21.25 years
The interquartile range is based on the data points’ ranks and gets reported with the median (as has been done above). This is in contrast to the standard deviation which is based on the data points’ values and gets reported with the mean.
As mentioned above, box-and-whisker plots are a good way to visualize dispersion (or the lack thereof) when the data is skewed:
# Create axes
ax = plt.axes()
# Create box plot
bp = df.boxplot(column='age', grid=False, return_type='dict', vert=False)
# Edit box plot
plt.setp(bp['boxes'], color='k')
plt.setp(bp['medians'], color='k')
plt.setp(bp['whiskers'], color='k')
# Edit axes
ax.set_title('Box Plot')
ax.yaxis.set_ticklabels([])
ax.yaxis.set_ticks([])
ax.spines['top'].set_visible(False)
ax.spines['right'].set_visible(False)
ax.spines['bottom'].set_visible(False)
ax.spines['left'].set_visible(False)
ax.set_xlabel('Age [yrs]')
# Create scatter plot
ax.scatter(df['age'], np.ones(n) + 0.03 * np.random.randn(n), s=4)
plt.show()
The quickest way to get descriptive statistics for numerical data is to use the .describe()
method:
# Describe a series (a column) of numerical data
print(df['age'].describe())
## count 120.000000
## mean 39.241667
## std 13.335992
## min 16.000000
## 25% 28.000000
## 50% 39.000000
## 75% 49.250000
## max 66.000000
## Name: age, dtype: float64
# Describe the numerical data in a data frame
print(df.describe())
## age times_donated
## count 120.000000 120.000000
## mean 39.241667 1.550000
## std 13.335992 1.891378
## min 16.000000 0.000000
## 25% 28.000000 0.000000
## 50% 39.000000 0.000000
## 75% 49.250000 3.000000
## max 66.000000 5.000000
If you wanted to know something about an entire population - eg the average height of all the adults in a city or the proportion of fish in a lake that are of a particular species - it would be impractical to measure this directly. While it might be theoretically possible to measure everyone’s height or to catch all of the fish, a realistic alternative would be to only look at a few individuals and then make generalizations about the whole from that limited subset. This is done using inferential statistics. These statistics are:
Topics within inferential statistics include:
Many of these have their own pages; go back to find them: