This page shows some ways of performing descriptive statistics in Python. Descriptive statistics are statistics that:

Describe a data set
Involve the exploration, summary and presentation of data
Are done first in a statistical analysis, giving you an initial, general idea about the data you’re working with and helping you to make sense of large sets of data
Allow you to identify broad patterns in the data which you can then investigate further
Help you to spot outliers or anomalies
Give initial answers to research questions
Present the full data set in compact, summarised formats, such as tables and charts
Can be done on both categorical and quantitative variables

Descriptive statistics are different to inferential statistics, which are discussed at the end of this page.

1 Packages

The code on this page uses the Pandas, Mimesis, Matplotlib, NumPy and SciPy packages. These can be installed from the terminal with:

$ python3.11 -m pip install pandas
$ python3.11 -m pip install mimesis
$ python3.11 -m pip install matplotlib
$ python3.11 -m pip install numpy
$ python3.11 -m pip install scipy

Replace python3.11 with the version of Python you are using. Once installed, these packages can be imported into your Python script via the following:

import pandas as pd
import mimesis
from matplotlib import pyplot as plt
import numpy as np
from scipy import stats as st

Additionally, the Standard Library module “random” is used. This does not need to be installed but does need to be imported:

import random

2 Example Data

This page will use fake data for its examples. This can be created in Python as follows:

# Set the 'seed' to ensure that the random data is the same each time we run the code
random.seed(20221229)
# Create fake data generators
person = mimesis.Person(seed=20221229)
development = mimesis.Development(seed=20221229)

# Initialize a data frame
df = pd.DataFrame()
# Populate the data frame with 120 rows of fake data
for _ in range(120):
    previous_donor = development.boolean()
    new_row = {
        'name': person.full_name(),
        'age': person.age(),
        'blood_type': person.blood_type(),
        'previous_donor': previous_donor,
        # If the person has not donated previously, ensure that "times_donated" is zero
        'times_donated': random.randint(1, 5) * previous_donor,
        # View on non-reimbursed blood donation
        'view': person.views_on(),
    }
    new_row = pd.DataFrame(new_row, index=[1])
    df = pd.concat([df, new_row], ignore_index=True)

# Take a look
print(df.head())

##                name  age blood_type  previous_donor  times_donated           view
## 0       Joana Ayers   41        AB+           False              0  Compromisable
## 1    Milford Gaines   17         B−            True              2  Very negative
## 2       Son Vazquez   43         O−            True              1  Compromisable
## 3  Corliss Shepherd   20         O+           False              0       Positive
## 4  Ramonita Huffman   41         B−            True              2       Positive

We have created a data set that simulates a blood donation drive that had 120 donors. It includes columns that have five different data types:

Categorical data types:
- Nominal (values without quantity): the donors’ blood_type
- Binary (there are only two possible values): whether or not each person is a previous_donor
- Ordinal (qualitative data where the possible values have an order or a rank): each donor’s view on non-reimbursed blood donation
Quantitative data types:
- Discrete (there are gaps between the possible values): times_donated - the number of times each donor has donated blood before; only whole numbers are possible
- Continuous (there are no gaps between the possible values): each person’s age. In this particular data set these values are whole numbers but a person’s age doesn’t have to be a whole number; it can be a fraction in between two whole numbers and thus ‘age’ is a continuous variable.

Now, how can we calculate and present information that describes this data set? We can do so using descriptive statistics.

3 Categorical Variables (aka Factors)

Let’s describe the categorical variables: blood_type, previous_donor and view. For starters, the sample size of each variable is equal to the length of the data frame:

# Sample size
n = len(df)

print(f'Sample size, n = {n}')

## Sample size, n = 120

3.1 Counts and Percentages

The counts of the nominal data (blood_type) are:

# Counts
counts = df['blood_type'].value_counts()

print(counts)

## blood_type
## AB−    20
## B−     18
## O−     18
## B+     15
## O+     14
## A−     13
## AB+    12
## A+     10
## Name: count, dtype: int64

…and the percentages are:

# Percentage
percentages = df['blood_type'].value_counts() / n * 100

print(f'Percent of donors with O+ blood: {percentages["O+"]:4.1f}%')

## Percent of donors with O+ blood: 11.7%

Repeating this for the binary data (previous_donor):

# Counts
counts = df['previous_donor'].value_counts()
# Percentage
percentages = counts / n * 100

print(percentages.round(1))

## previous_donor
## False    50.8
## True     49.2
## Name: count, dtype: float64

…and the ordinal data (view):

# Counts
counts = df['view'].value_counts()
# Percentage
percentages = counts / n * 100

print(percentages.round(1))

## view
## Positive         23.3
## Negative         22.5
## Compromisable    20.0
## Neutral          17.5
## Very negative    16.7
## Name: count, dtype: float64

3.2 Tables and Graphs

Categorical variables can be descriptively presented in frequency tables and bar plots.

3.2.1 One-Way Frequency Table

A one-way frequency table can be used to show the counts and percentages of the nominal blood_type data more completely:

# Counts
counts = df['blood_type'].value_counts().rename('count')
# Sort the columns alphabetically
counts = counts.reindex(sorted(counts.index), axis=1)
# Percentages
percentages = (counts / n * 100).rename('%')
# Merge the counts and percentages into one table
frequency_table = pd.merge(counts, percentages, left_index=True, right_index=True)
# Calculate the totals
totals = frequency_table.sum(axis=0).rename('total')
# Add the totals as a new column
frequency_table = pd.concat([frequency_table.T, totals], axis=1).round(1)

print(frequency_table)

##          A+   AB+   AB−    A−    B+    B−    O+    O−  total
## count  10.0  12.0  20.0  13.0  15.0  18.0  14.0  18.0  120.0
## %       8.3  10.0  16.7  10.8  12.5  15.0  11.7  15.0  100.0

3.2.2 Single-Group Bar Plot

The best way to show this data on a graph would be to make a single-group bar plot. More information on creating these can be found on this page and this page.

# Counts
counts = df['blood_type'].value_counts(sort=False)
# Create bar plot
ax = counts.plot.bar('blood_type', rot=0, color='#95b8d1')
# Labels
ax.set_title('Displaying Nominal Data in a Bar Plot')
ax.set_ylabel('Count')
ax.set_xlabel('Blood Type')

plt.show()

3.2.3 Two-Way Frequency Table (aka a Contingency Table or Cross-Tabulation)

Use a two-way frequency table to show two variables at once. In this example they are the binary and the ordinal variables (previous_donor and view):

# Create a cross-tabulation of two variables
ct = pd.crosstab(df['previous_donor'], df['view'])
# Re-order the columns
ct = ct[['Very negative', 'Negative', 'Neutral', 'Compromisable', 'Positive']]
# Calculate the total counts and add them to the cross-tabulation
totals = ct.sum(axis=0).rename('total')
ct = pd.concat([ct.T, totals], axis=1).round(1)
totals = ct.sum(axis=0).rename('total')
ct = pd.concat([ct.T, totals], axis=1).round(1)

print(ct)

##        Very negative  Negative  Neutral  Compromisable  Positive  total
## False              6        15       15             11        14     61
## True              14        12        6             13        14     59
## total             20        27       21             24        28    120

# Calculate the percentages
ct = (ct / n * 100).round(1)

print(ct)

##        Very negative  Negative  Neutral  Compromisable  Positive  total
## False            5.0      12.5     12.5            9.2      11.7   50.8
## True            11.7      10.0      5.0           10.8      11.7   49.2
## total           16.7      22.5     17.5           20.0      23.3  100.0

3.2.4 Multi-Group Bar Plot

The best way to show two categorical variables on a graph at once is to make a multi-group bar plot. More information on creating these can be found on this page and this page.

# Create a cross-tabulation of two variables
ct = pd.crosstab(df['previous_donor'], df['view'])
# Re-order the columns
ct = ct[['Very negative', 'Negative', 'Neutral', 'Compromisable', 'Positive']]

# Plot with an adjusted shape to accommodate the legend
ax = plt.axes([0.1, 0.1, 0.7, 0.84])
# Create bar plot
ct.T.plot.bar(rot=0, color=['#95b8d1', '#800000'], ax=ax)
# Labels
ax.set_title('Displaying Binary and Ordinal Data in a Bar Plot')
ax.set_ylabel('Count')
ax.set_xlabel('Views on Non-Reimbursed Blood Donation')
# Legend
plt.gca().legend(
    title='Previous Donor', fontsize=8,
    loc='center left', bbox_to_anchor=(1, 0.5)
)

plt.show()

3.3 The Categorical Data Type

As it stands, Python doesn’t realise that the ordinal data (the views column) has an inherent order. Logically, a Neutral viewpoint should exist between a Negative and a Positive viewpoint - they are ordered. Fortunately, Python has a special ‘Categorical’ data type that enables both a name and an order to be stored, and the view column can be converted to Categorical via the .Categorical() method as shown below:

order = ['Very negative', 'Negative', 'Neutral', 'Compromisable', 'Positive']
df['view'] = pd.Categorical(df['view'], order, ordered=True)

print(df['view'].unique())

## ['Compromisable', 'Very negative', 'Positive', 'Neutral', 'Negative']
## Categories (5, object): ['Very negative' < 'Negative' < 'Neutral' < 'Compromisable' < 'Positive']

4 Quantitative (Numerical) Variables

Now let’s describe the two quantitative variables: times_donated (which are discrete values) and age (which are continuous values).

4.1 Counts and Percentages

The counts of the discrete data are:

# Counts
counts = df['times_donated'].value_counts()
# Sort the columns alphabetically
counts = counts.reindex(sorted(counts.index), axis=1)

print(counts)

## times_donated
## 0    61
## 1    12
## 2     9
## 3    11
## 4    12
## 5    15
## Name: count, dtype: int64

The percentages of the discrete data are:

# Percentage
percentages = (counts / n * 100).round(1)

print(percentages)

## times_donated
## 0    50.8
## 1    10.0
## 2     7.5
## 3     9.2
## 4    10.0
## 5    12.5
## Name: count, dtype: float64

4.2 Frequency Distribution and Histogram

The counts and percentages of the continuous data can be shown in a frequency distribution whereby the values are grouped into bins. The histogram function from the Numpy package is useful for this (download and install the Numpy package from the terminal with python3.11 -m pip install numpy):

# Create data
counts, bins = np.histogram(df['age'])

print(counts)

## [14  9 11 16 12 17 14 10 11  6]

print(bins)

## [16. 21. 26. 31. 36. 41. 46. 51. 56. 61. 66.]

These can then be tabulated:

intervals = [f'{int(v)} to {int(bins[i +1])}' for i, v in enumerate(bins[:-1])]
dct = {
    'Interval (bin)': intervals,
    'Frequency': counts,
    '%': (counts / n * 100).round(1),
    'Cumulative %': (counts / n * 100).cumsum().round(1)
}
frequency_distribution = pd.DataFrame(dct)

print(frequency_distribution)

##   Interval (bin)  Frequency     %  Cumulative %
## 0       16 to 21         14  11.7          11.7
## 1       21 to 26          9   7.5          19.2
## 2       26 to 31         11   9.2          28.3
## 3       31 to 36         16  13.3          41.7
## 4       36 to 41         12  10.0          51.7
## 5       41 to 46         17  14.2          65.8
## 6       46 to 51         14  11.7          77.5
## 7       51 to 56         10   8.3          85.8
## 8       56 to 61         11   9.2          95.0
## 9       61 to 66          6   5.0         100.0

The data can be plotted in this format in a histogram (see more on this page and this page):

ax = df['age'].plot.hist(bins=11, color='#95b8d1')
ax.set_title('Histogram of Continuous Data')
ax.set_xlabel('Age [yrs]')
ax.set_xlim(16, 66)

plt.show()

4.3 Quartiles and Box-and-Whisker Plot

An alternative to the above would be to instead describe the data with quartiles and box-and-whisker plots.

4.3.1 Single Group

A single group of continuous data can be describe using a five number summary:

# Five number summary
minimum = df['age'].min()
first_quartile = df['age'].quantile(0.25)
median = df['age'].median()  # aka the second quartile
third_quartile = df['age'].quantile(0.75)
maximum = df['age'].max()

print(minimum, first_quartile, median, third_quartile, maximum)

## 16 28.0 39.0 49.25 66

The five number summary divides the ordered data points into four equal parts, hence why these parts are known as quartiles. In general, when dividing ordered data points up into equal parts these parts are known as ‘quantiles’; specific examples of quantiles include quartiles (four parts), deciles (ten parts) and percentiles (one hundred parts):

# Percentiles
tenth_percentile = df['age'].quantile(0.1)  # 10th percentile
ninetieth_percentile = df['age'].quantile(0.9)  # 90th percentile

print(f'10th percentile is {tenth_percentile:4.1f} years; 90th percentile is {ninetieth_percentile:4.1f} years')

## 10th percentile is 20.0 years; 90th percentile is 57.0 years

Another potentially useful descriptive statistic is the range: the largest value minus the smallest:

print(f'The range is {maximum - minimum}')

## The range is 50

4.3.2 Box-and-Whisker Plot (Box Plot)

The five number summary can be represented in a box-and-whisker plot:

ax = plt.axes()
bp = df.boxplot(column='age', grid=False, return_type='dict', vert=False, ax=ax)
plt.setp(bp['boxes'], color='k')
plt.setp(bp['medians'], color='k')
plt.setp(bp['whiskers'], color='k')
ax.yaxis.set_ticklabels([])
ax.yaxis.set_ticks([])
ax.set_title('Box-and-Whisker Plot of Continuous Data')
ax.set_xlabel('Age [yrs]')
ax.scatter(df['age'], np.ones(n) + 0.03 * np.random.randn(n), s=4)

plt.show()

Viewing data in a histogram or a box-and-whisker plot can help in finding extreme observations (outliers) and asymmetric distributions (skewness). In particular, data points that are further than 1.5 times the length of the box away from the relevant quartile can be considered outliers.

4.3.3 Multiple Groups

Splitting the data into groups can help in seeing differences or similarities between them.

ax = plt.axes()
bp = df.boxplot(column='age', by='previous_donor', grid=False, return_type='dict', ax=ax)
# Iterate over each box
for box in bp:
    plt.setp(box['boxes'], color='k')
    plt.setp(box['medians'], color='k')
    plt.setp(box['whiskers'], color='k')
plt.suptitle('')
ax.set_title('Box-and-Whisker Plot of Continuous Data')
ax.set_ylabel('Age [yrs]')
ax.set_xlabel('Previous Donor')

plt.show()

4.4 Measures of the Average (Central Tendency)

4.4.1 Mean

The mean is based on the values of the data points and is usually more representative when the data is not skewed. Use “\(\mu\)” (the lowercase Greek letter ‘mu’) for the population mean (eg if you are doing inferential statistics) and “\(\bar{x}\)” (pronounced ‘x bar’) for the sample mean (eg if you are doing descriptive statistics):

# Mean of a Pandas series
x_bar = df['age'].mean()

print(f'The mean age is {x_bar:4.1f} years')

## The mean age is 39.2 years

This can also be done with NumPy’s mean() function:

# Mean of an array-like object
x_bar = np.mean(df['age'])

print(f'The mean age is {x_bar:4.1f} years')

## The mean age is 39.2 years

4.4.2 Median

The median is based on the ranks of the data points and is usually more representative when the data is skewed:

# Median of a Pandas series
median = df['times_donated'].median()

print(f'The median number of previous donations is {int(median)}')

## The median number of previous donations is 0

This can also be done with NumPy’s median() function:

# Median of an array-like object
median = np.median(df['times_donated'])

print(f'The median number of previous donations is {int(median)}')

## The median number of previous donations is 0

4.4.3 Mode

The mode is based on the frequency at which data points appear and is usually more representative when the data is nominal:

# Mode of a Pandas series
mode = df['blood_type'].mode()
# If there are multiple modes they will all be returned. If this is the case,
# let's just take the first
mode = mode[0]

print(f'The most common blood type in this sample is {mode}')

## The most common blood type in this sample is AB−

For the record, in real life, AB− is the LEAST COMMON blood type!

The mode can also be found with SciPy’s mode() function, although this only works with numeric data:

# Mode of an array-like object of numerics
mode, count = st.mode(df['age'])

print(f'The most common age in this sample is {mode}')

## The most common age in this sample is 43

4.5 Measures of Dispersion (Variability)

The dispersion within a set of data points can be described and visualized using:

The standard deviation and a dot plot (usually better for symmetric data)
The interquartile range and a box plot (usually better for skewed data)

4.5.1 Standard Deviation

The standard deviation is the square root of the variance. The variance is the “mean squared distance of the data points from the mean” - in other words it’s a measure of how dispersed the data is around the mean in units that are the same as that of the data itself but squared. By implication, the units of the standard deviation are exactly the same as that of the data itself. So the variance is meaningful as a measure of variability but the standard deviation is useful because it has the same units as the data. An example of this usefulness is the fact that we can make statements like “about 95% of the data is found within two standard deviations either side of the sample mean, assuming a normal distribution” (see the section on Confidence Levels below), which is only possible because the units of the standard deviation, the mean and the data itself are all the same.

Use “\(\sigma\)” (the lowercase Greek letter ‘sigma’) for population standard deviation (eg if you are doing inferential statistics) and “\(s\)” for sample standard deviation (eg if you are doing descriptive statistics). Population standard deviation is calculated using a delta degrees of freedom (“ddof”) value of 0 while sample standard deviation has a ddof value of 1:

# Population standard deviation
σ = df['age'].std(ddof=0)

print(σ)

## 13.280308626768514

# Sample standard deviation
s = df['age'].std(ddof=1)

print(f'The mean age is {x_bar:4.1f} years with a sample standard deviation of {s:4.2f} years')

## The mean age is 39.2 years with a sample standard deviation of 13.34 years

The standard deviation is based on the data points’ values and gets reported with the mean (as has been done above). This is in contrast to the interquartile range which is based on the data points’ ranks and gets reported with the median.

4.5.1.1 Aside: Population vs Sample Standard Deviations: Bessel’s Correction

As mentioned above:

Population standard deviation is calculated using a delta degrees of freedom (“ddof”) value of 0 while sample standard deviation has a ddof value of 1

This is programmers’ way of saying that the formula for the population standard deviation involves dividing by \(n\) (the sample size) whereas the formula for the sample standard deviation involves dividing by \(n - 1\). The ‘ddof’ value is how much gets subtracted from the sample size in the formula that you want Python to use.

Broadly speaking, the degrees of freedom of a set of numbers is the sample size minus the number of pieces of information known about the numbers. For example, if someone asked you to choose 8 numbers then you would have the freedom to choose 8 values of your pleasing. However, if someone asked you to choose 8 numbers that summed to \(x\) then you would only have the freedom to choose 7 values because the 8th value would have to be such that the overall total was \(x\). Similarly, if someone asked you to choose 8 numbers that summed to \(x\) and had a product of \(y\) then you would only be able to choose 6 values freely; the 7th and 8th values would need to be such that the total sum and product were correct.

When calculating the sample standard deviation the mean of the values is necessarily calculated first, hence one piece of information about the numbers is created and the number of degrees of freedom decreases by one as a result. Thus, the change in the number of degrees of freedom is 1 and, because ‘delta’ is used in maths to mean ‘change’, the delta degrees of freedom (‘ddof’) is 1. When calculating the population standard deviation, however, considerations about the number of degrees of freedom is irrelevant (the ddof is 0): you have all the information that exists about this metric in this population and so you are calculating exact values.

This idea of subtracting 1 from the sample size when calculating sample variance (and hence when calculating sample standard deviation) is known as “Bessel’s correction”. To quote Wikipedia: “it corrects the bias in the estimation of the population variance, and some, but not all of the bias in the estimation of the population standard deviation”. See the pages on Bessel’s correction and on the Unbiased estimation of standard deviation for more information.

4.5.1.2 Aside: Confidence Levels

A useful rule-of-thumb is that about 95% of normally-distributed data is found within two standard deviations either side of the sample mean (see the Wikipedia page on this “68–95–99.7 rule”). For the sake of interest, here’s how to calculate this:

# Number of standard deviations
z_critical = 2
# Cumulative distribution function of the normal distribution
phi = st.norm.cdf(z_critical)
# Number of tails
tails = 2
# Significance level
alpha = (1 - phi) * tails
# Confidence level
c = (1 - alpha) * 100

print(f'{c}% of normally-distributed data is found within two standard deviations of the mean')

## 95.44997361036415% of normally-distributed data is found within two standard deviations of the mean

The reverse of the above statement is possibly also interesting: “95% of normally-distributed data is found within about 1.96 standard deviations either side of the sample mean”. Again, here’s how to calculate this:

# Confidence level
c = 95  # %
# Significance level
alpha = 1 - (c / 100)
# Number of tails
tails = 2
# Number of standard deviations
z_critical = st.norm.ppf(1 - alpha / tails)

print(f'95% of normally-distributed data is found within {z_critical} standard deviations of the mean')

## 95% of normally-distributed data is found within 1.959963984540054 standard deviations of the mean

4.5.2 Dot Plot

Dot plots are a good way to visualize dispersion (or the lack thereof) when the data is not skewed:

# Create axes
ax = plt.axes()
# Create scatter plot
ax.scatter(df['age'], np.ones(n) + 0.1 * np.random.randn(n), s=4)
# Create straight lines
ax.axhline(1, c='k')
ax.axvline(x_bar - s, 0.4, 0.6, c='k', ls='--')
ax.axvline(x_bar, 0.25, 0.75, c='k', ls='--')
ax.axvline(x_bar + s, 0.4, 0.6, c='k', ls='--')
# Axes' customisation
ax.set_title('Dot Plot of Continuous Data')
ax.yaxis.set_ticklabels([])
ax.yaxis.set_ticks([])
ax.set_ylim([0, 2])
ax.spines['top'].set_visible(False)
ax.spines['right'].set_visible(False)
ax.spines['bottom'].set_visible(False)
ax.spines['left'].set_visible(False)
ax.set_xlabel('Age [yrs]')

plt.show()

4.5.3 Interquartile Range

As mentioned above, quartiles are useful for describing numerical data and deciding if it is skewed. If it is, it might be decided that the interquartile range is be a better measure of dispersion and variability than the standard deviation. Remember that ‘quartiles’ are a type of ‘quantile’, hence why the .quantile() method is being used below:

first_quartile = df['age'].quantile(0.25)
second_quartile = df['age'].quantile(0.5)
third_quartile = df['age'].quantile(0.75)
iqr = third_quartile - first_quartile

print(f'The median age is {second_quartile} years with an interquartile range of {iqr} years')

## The median age is 39.0 years with an interquartile range of 21.25 years

In one line of code:

# One-liner
iqr = df['age'].quantile([0.25, 0.75]).diff().iloc[-1]

print(f'The median age is {second_quartile} years with an interquartile range of {iqr} years')

## The median age is 39.0 years with an interquartile range of 21.25 years

…or, using SciPy:

# One-liner
iqr = st.iqr(df['age'])

print(f'The median age is {second_quartile} years with an interquartile range of {iqr} years')

## The median age is 39.0 years with an interquartile range of 21.25 years

The interquartile range is based on the data points’ ranks and gets reported with the median (as has been done above). This is in contrast to the standard deviation which is based on the data points’ values and gets reported with the mean.

4.5.4 Box-and-Whisker Plots

As mentioned above, box-and-whisker plots are a good way to visualize dispersion (or the lack thereof) when the data is skewed:

# Create axes
ax = plt.axes()
# Create box plot
bp = df.boxplot(column='age', grid=False, return_type='dict', vert=False)
# Edit box plot
plt.setp(bp['boxes'], color='k')
plt.setp(bp['medians'], color='k')
plt.setp(bp['whiskers'], color='k')
# Edit axes
ax.set_title('Box Plot')
ax.yaxis.set_ticklabels([])
ax.yaxis.set_ticks([])
ax.spines['top'].set_visible(False)
ax.spines['right'].set_visible(False)
ax.spines['bottom'].set_visible(False)
ax.spines['left'].set_visible(False)
ax.set_xlabel('Age [yrs]')
# Create scatter plot
ax.scatter(df['age'], np.ones(n) + 0.03 * np.random.randn(n), s=4)

plt.show()

4.6 A Shortcut

The quickest way to get descriptive statistics for numerical data is to use the .describe() method:

# Describe a series (a column) of numerical data
print(df['age'].describe())

## count    120.000000
## mean      39.241667
## std       13.335992
## min       16.000000
## 25%       28.000000
## 50%       39.000000
## 75%       49.250000
## max       66.000000
## Name: age, dtype: float64

# Describe the numerical data in a data frame
print(df.describe())

##               age  times_donated
## count  120.000000     120.000000
## mean    39.241667       1.550000
## std     13.335992       1.891378
## min     16.000000       0.000000
## 25%     28.000000       0.000000
## 50%     39.000000       0.000000
## 75%     49.250000       3.000000
## max     66.000000       5.000000

5 Inferential Statistics

If you wanted to know something about an entire population - eg the average height of all the adults in a city or the proportion of fish in a lake that are of a particular species - it would be impractical to measure this directly. While it might be theoretically possible to measure everyone’s height or to catch all of the fish, a realistic alternative would be to only look at a few individuals and then make generalizations about the whole from that limited subset. This is done using inferential statistics. These statistics are:

Used to infer information about the underlying distribution or population
Used to make generalizations about a population from a sample
Done on data that is a subset of the entire amount of data that exists, due to it being impractical to collect every piece of data that exists
Done after descriptive statistics. In the real world, people usually:
1. Collect data from a subset (sample) of people in a population
2. Do descriptive statistics on the data that was collected (the sample)
3. Do inferential statistics to make generalizations about the population

Topics within inferential statistics include:

Probability distributions
Hypothesis testing
Correlation testing
Regression analysis

Many of these have their own pages; go back to find them:

⇦ Back

Statistics in Python:Descriptive Statistics