Grouping and aggregating data is also known as a “split-apply-combine” process: splitting the data into groups is followed by applying a function to each and combining them back into a single data frame. See the Pandas User Guide here.
The code on this page uses the Statsmodels, NumPy and Pandas packages. These can be installed from the terminal with:
$ python3.11 -m pip install statsmodels
$ python3.11 -m pip install numpy
$ python3.11 -m pip install pandas
Replace python3.11
with the version of Python you are using. Once installed, these packages can be imported into your Python script via the following:
from statsmodels import api as sm
import numpy as np
import pandas as pd
For this page we will use the “American National Election Survey 1996” dataset from Statsmodels (see the documentation here):
# Load the data
dataset = sm.datasets.anes96.load_pandas()
The data within this data set has 11 columns although we will only use two: age
and PID
. The PID
column represents which political party each survey respondent identified with but is encoded as a number. We can decode it using the following dictionary:
pid = {
0: 'Strong Democrat',
1: 'Weak Democrat',
2: 'Independent-Democrat',
3: 'Independent-Independent',
4: 'Independent-Republican',
5: 'Weak Republican',
6: 'Strong Republican',
}
So let’s go ahead with the pre-processing of our raw data:
cols = ['PID', 'age']
df = dataset['data'].loc[:, cols]
df['PID'] = df['PID'].replace(pid)
print(df.head(10))
## PID age
## 0 Strong Republican 36.0
## 1 Weak Democrat 20.0
## 2 Weak Democrat 24.0
## 3 Weak Democrat 28.0
## 4 Strong Democrat 68.0
## 5 Weak Democrat 21.0
## 6 Weak Democrat 77.0
## 7 Independent-Republican 21.0
## 8 Independent-Independent 31.0
## 9 Strong Democrat 39.0
The numerical column(s) in the data set can be described using the .describe()
method:
print(df.describe())
## age
## count 944.000000
## mean 47.043432
## std 16.423130
## min 19.000000
## 25% 34.000000
## 50% 44.000000
## 75% 58.000000
## max 91.000000
Using include='all'
will include the categorical column(s):
print(df.describe(include='all'))
## PID age
## count 944 944.000000
## unique 7 NaN
## top Strong Democrat NaN
## freq 200 NaN
## mean NaN 47.043432
## std NaN 16.423130
## min NaN 19.000000
## 25% NaN 34.000000
## 50% NaN 44.000000
## 75% NaN 58.000000
## max NaN 91.000000
These summary statistics can be performed individually using Pandas’s methods, for example .mean()
will calculate the mean:
mean = df['age'].mean()
print(mean)
## 47.043432203389834
This can also be accomplished with NumPy’s mean()
function:
mean = np.mean(df['age'])
print(mean)
## 47.043432203389834
While this is fine for entire columns we often want to calculate the summary statistics for individual groups within the data, eg the mean age for each PID
value. In order to do that we first need to group the data:
We can group by a column of categorical data via the .groupby()
method, but this returns a DataFrameGroupBy object
which is of no use to us:
grouped = df.groupby('PID')
print(grouped)
## <pandas.core.groupby.generic.DataFrameGroupBy object at 0x7fdce8b0ef90>
If we index one of the columns, we just get a different object type which is of no use:
grouped = df.groupby('PID')['age']
print(grouped)
## <pandas.core.groupby.generic.SeriesGroupBy object at 0x7fdce989a390>
So the grouping has been made, but we need to aggregate it before we can use it:
Pandas has 21 main built-in aggregation methods. All of them are listed here but the main ones are:
.mean()
and .median()
for central tendency.std()
for the standard deviation.min()
and .max()
for the range.nunique()
for the number of unique values.count()
for the number of non-null values.sum()
for the summeans = df.groupby('PID')['age'].mean()
print(means.head())
## PID
## Independent-Democrat 44.083333
## Independent-Independent 47.324324
## Independent-Republican 48.968085
## Strong Democrat 50.165000
## Strong Republican 48.091429
## Name: age, dtype: float64
…or, because we only have one un-grouped column in this particular data frame, we can leave the column name out:
means = df.groupby('PID').mean()
print(means.head())
## age
## PID
## Independent-Democrat 44.083333
## Independent-Independent 47.324324
## Independent-Republican 48.968085
## Strong Democrat 50.165000
## Strong Republican 48.091429
These values can also be sorted:
means = df.groupby('PID')['age'].mean().sort_values()
print(means.head())
## PID
## Weak Democrat 43.622222
## Independent-Democrat 44.083333
## Weak Republican 46.620000
## Independent-Independent 47.324324
## Strong Republican 48.091429
## Name: age, dtype: float64
This sorting is ascending by default. Sort descending by setting this behaviour to False
via the keyword argument:
means = df.groupby('PID')['age'].mean().sort_values(ascending=False)
print(means.head())
## PID
## Strong Democrat 50.165000
## Independent-Republican 48.968085
## Strong Republican 48.091429
## Independent-Independent 47.324324
## Weak Republican 46.620000
## Name: age, dtype: float64
Resetting the index is often more useful for downstream analysis:
means = df.groupby('PID')['age'].mean().sort_values(ascending=False)
means = means.reset_index()
print(means.head())
## PID age
## 0 Strong Democrat 50.165000
## 1 Independent-Republican 48.968085
## 2 Strong Republican 48.091429
## 3 Independent-Independent 47.324324
## 4 Weak Republican 46.620000
If you want to rename the one column you have aggregated, it’s easier to do this before resetting the index while it’s still the only column in the data frame:
means = df.groupby('PID')['age'].mean().sort_values(ascending=False)
means.columns = ['mean_age']
means = means.reset_index()
print(means.head())
## PID age
## 0 Strong Democrat 50.165000
## 1 Independent-Republican 48.968085
## 2 Strong Republican 48.091429
## 3 Independent-Independent 47.324324
## 4 Weak Republican 46.620000
Using multiple methods one after another on an object is known as chaining.
Let’s look at two columns at once:
cols = ['PID', 'age', 'TVnews']
df = dataset['data'].loc[:, cols]
df['PID'] = df['PID'].replace(pid)
print(df.head(10))
## PID age TVnews
## 0 Strong Republican 36.0 7.0
## 1 Weak Democrat 20.0 1.0
## 2 Weak Democrat 24.0 7.0
## 3 Weak Democrat 28.0 4.0
## 4 Strong Democrat 68.0 7.0
## 5 Weak Democrat 21.0 3.0
## 6 Weak Democrat 77.0 7.0
## 7 Independent-Republican 21.0 1.0
## 8 Independent-Independent 31.0 7.0
## 9 Strong Democrat 39.0 0.0
Now, if we just want to perform one type of aggregation we can simply do the same thing as before:
stats = df.groupby('PID').mean()
print(stats)
## age TVnews
## PID
## Independent-Democrat 44.083333 3.675926
## Independent-Independent 47.324324 3.540541
## Independent-Republican 48.968085 3.776596
## Strong Democrat 50.165000 4.350000
## Strong Republican 48.091429 3.702857
## Weak Democrat 43.622222 3.311111
## Weak Republican 46.620000 3.480000
However, if we want to have more control and perform different aggregation methods to different columns we need to be more specific. This can be done using the .agg()
method:
stats = df.groupby('PID').agg({'age': 'mean', 'TVnews': 'std'})
print(stats)
## age TVnews
## PID
## Independent-Democrat 44.083333 2.681988
## Independent-Independent 47.324324 3.078551
## Independent-Republican 48.968085 2.692699
## Strong Democrat 50.165000 2.651917
## Strong Republican 48.091429 2.794235
## Weak Democrat 43.622222 2.533108
## Weak Republican 46.620000 2.521638
Here we are calculating the mean for one column and the standard deviation for the other. Again, this step can be followed by resetting the index:
stats = df.groupby('PID').agg({'age': 'mean', 'TVnews': 'std'}).reset_index()
print(stats)
## PID age TVnews
## 0 Independent-Democrat 44.083333 2.681988
## 1 Independent-Independent 47.324324 3.078551
## 2 Independent-Republican 48.968085 2.692699
## 3 Strong Democrat 50.165000 2.651917
## 4 Strong Republican 48.091429 2.794235
## 5 Weak Democrat 43.622222 2.533108
## 6 Weak Republican 46.620000 2.521638
We aren’t limited to the built-in aggregation methods provided by Pandas, we can write our own functions and apply them to groups using the .apply()
method:
def binning(column):
"""Separate ages into bins."""
mean_age = column.mean()
if mean_age < 45:
return '<45'
elif mean_age < 50:
return '45 to 50'
else:
return '50+'
# Separate out the average ages of the groups into bins
binned = df.groupby('PID')['age'].apply(binning)
print(binned)
## PID
## Independent-Democrat <45
## Independent-Independent 45 to 50
## Independent-Republican 45 to 50
## Strong Democrat 50+
## Strong Republican 45 to 50
## Weak Democrat <45
## Weak Republican 45 to 50
## Name: age, dtype: object
These custom functions can also take arguments as per normal:
def over_45(column, normalise=False):
"""Count the number of people over 45."""
mask = column < 45
if normalise:
return sum(mask) / len(mask)
else:
return sum(mask)
# Get the proportion of people over 45
binned = df.groupby('PID')['age'].apply(over_45, normalise=True)
print(binned)
## PID
## Independent-Democrat 0.620370
## Independent-Independent 0.459459
## Independent-Republican 0.425532
## Strong Democrat 0.445000
## Strong Republican 0.480000
## Weak Democrat 0.555556
## Weak Republican 0.566667
## Name: age, dtype: float64
This example is an optional argument, so we don’t actually need to use it:
# Get the number of people over 45
binned = df.groupby('PID')['age'].apply(over_45)
print(binned)
## PID
## Independent-Democrat 67
## Independent-Independent 17
## Independent-Republican 40
## Strong Democrat 89
## Strong Republican 84
## Weak Democrat 100
## Weak Republican 85
## Name: age, dtype: int64
Note that the built-in Pandas methods will always be fast compared to custom functions. This is because they take advantage of something known as vectorisation which is a particularly efficient for computation. Custom functions, on the other hand, effectively loop over all the elements in a column which is much slower.
Another use of the .apply()
method to apply custom functions is to define lambda functions. These functions use a dummy variable (usually “x”) to represent the values in the column, and whatever you use “x” for then gets used on every value in the column. For example, here’s how to use NumPy’s percentile()
function to get the 95th percentile of peoples’ age for each group:
# Group the data by PID
grouped = df.groupby('PID')['age']
# Get the 95th percentile of age for each group
grouped = grouped.apply(lambda x: np.percentile(x, 95)).reset_index()
# Sort the values
grouped = grouped.sort_values('age')
print(grouped)
## PID age
## 5 Weak Democrat 69.05
## 1 Independent-Independent 74.00
## 6 Weak Republican 75.00
## 0 Independent-Democrat 76.00
## 2 Independent-Republican 76.00
## 3 Strong Democrat 78.00
## 4 Strong Republican 81.00
Finding things like the mean and median of a numeric column makes sense because those can be calculated, but what would it mean to do it for a categorical column? Categorical values - such as peoples political leanings - usually do not have an order. However, in certain situations it makes sense to give them one, eg to define a political spectrum from left-wing to right-wing. This can be done in Python with the Categorical
function from Pandas:
order = [
'Strong Democrat', 'Weak Democrat', 'Independent-Democrat',
'Independent-Independent', 'Independent-Republican', 'Weak Republican',
'Strong Republican',
]
df['PID'] = pd.Categorical(df['PID'], order, ordered=True)
Now that the PID
column has ordinal values (ordered categorical values) the concept of finding, for example, a median makes sense:
idx = np.median(df['PID'].cat.codes)
median_category = order[int(idx)]
print(median_category)
## Independent-Democrat
This is the ‘middle’ political leaning of the people that were surveyed.