The code on this page uses the Statsmodels, Pandas, Seaborn and Matplotlib packages. These can be installed from the terminal with:
$ python3.11 -m pip install statsmodels
$ python3.11 -m pip install pandas
$ python3.11 -m pip install seaborn
$ python3.11 -m pip install matplotlib
Replace python3.11
with the version of Python you are using. Once installed, these packages can be imported into your Python script via the following:
from statsmodels import api as sm
import pandas as pd
import seaborn as sns
from matplotlib import pyplot as plt
For this page we will use the “American National Election Survey 1996” dataset from Statsmodels (see the documentation here):
# Load the data
dataset = sm.datasets.anes96.load_pandas()
The data within this data set has 11 columns but we will only use three of them:
PID
: The political party with which the survey respondent identifiededuc
: Education levelselfLR
: Self-reported political leanings from “Left” to “Right” on a scale of 1 to 7All three of these columns are numerically encoded: they contain numbers which represent their categorical meaning. We will decode the PID
and educ
columns as follows:
pid = {
0: 'Strong Democrat',
1: 'Weak Democrat',
2: 'Indep-Democrat',
3: 'Indep-Indep',
4: 'Indep-Republican',
5: 'Weak Republican',
6: 'Strong Republican',
}
educ = {
1: '1-8 grades',
2: 'Some high school',
3: 'High school',
4: 'Some college',
5: 'College degree',
6: "Master's",
7: 'PhD',
}
cols = ['PID', 'educ', 'selfLR']
df = dataset['data'].loc[:, cols]
df['PID'] = df['PID'].replace(pid)
df['educ'] = df['educ'].replace(educ)
print(df.head(10))
## PID educ selfLR
## 0 Strong Republican High school 7.0
## 1 Weak Democrat Some college 3.0
## 2 Weak Democrat Master's 2.0
## 3 Weak Democrat Master's 3.0
## 4 Strong Democrat Master's 5.0
## 5 Weak Democrat Some college 3.0
## 6 Weak Democrat Some college 5.0
## 7 Indep-Republican Some college 5.0
## 8 Indep-Indep Some college 4.0
## 9 Strong Democrat High school 3.0
A pivot table takes the groups from within a data frame and re-organises the data according to those groups. Pandas has two methods that will do this: .pivot()
and .pivot_table()
. The difference is that the first needs the data frame to already be grouped while the second can do the grouping, aggregating and pivoting in one step.
.pivot()
The fact that this method requires a data frame that is already grouped is not a problem because we can use the .groupby()
method for this (see this page for more):
.groupby()
outputs data in long format.pivot()
outputs data in wide formatLet’s have a go with getting the mean value for respondents’ political leanings, grouped by political party identification and education level:
grouped = df.groupby(['PID', 'educ'])['selfLR'].mean()
grouped = grouped.reset_index()
print(grouped.head())
## PID educ selfLR
## 0 Indep-Democrat 1-8 grades 4.000000
## 1 Indep-Democrat College degree 3.384615
## 2 Indep-Democrat High school 3.892857
## 3 Indep-Democrat Master's 3.555556
## 4 Indep-Democrat PhD 3.200000
And, now that the data is grouped, we can pivot it:
pivot_table = grouped.pivot(index='PID', columns='educ', values='selfLR')
print(pivot_table)
## educ 1-8 grades College degree High school Master's PhD Some college Some high school
## PID
## Indep-Democrat 4.0 3.384615 3.892857 3.555556 3.200000 4.266667 4.250000
## Indep-Indep NaN 4.333333 3.833333 4.333333 4.000000 3.888889 4.666667
## Indep-Republican 6.0 4.875000 5.000000 5.000000 4.812500 4.937500 4.857143
## Strong Democrat 4.2 3.117647 3.762712 2.925000 2.545455 3.000000 4.421053
## Strong Republican 6.0 5.882353 5.642857 5.735849 5.680000 5.515152 6.250000
## Weak Democrat 4.0 3.647059 4.102041 3.195122 2.782609 3.916667 3.600000
## Weak Republican NaN 4.933333 5.000000 5.000000 4.823529 5.175000 5.400000
This data is perfect for plotting in a heat map, although before we do that it would be clever to re-order the columns into their logical order. The easiest way to do that in this example is just to start again, so here we go:
# Load the data
dataset = sm.datasets.anes96.load_pandas()
# Select the columns
cols = ['PID', 'educ', 'selfLR']
df = dataset['data'].loc[:, cols]
# Group the data
grouped = df.groupby(['PID', 'educ'])['selfLR'].mean()
grouped = grouped.reset_index()
# Pivot
pivot_table = grouped.pivot(index='PID', columns='educ', values='selfLR')
pivot_table = pivot_table.rename(index=pid)
pivot_table = pivot_table.rename(columns=educ)
# Plot
label = 'Self-Reported Political Leanings'
sns.heatmap(pivot_table, annot=True, cbar_kws={'label': label})
plt.ylabel('Political Party Identified With')
plt.xlabel('Education Level')
plt.xticks(rotation=60)
plt.tight_layout()
plt.show()
.pivot_table()
Instead of chaining .groupby()
, .mean()
, .reset_index()
and .pivot()
we can do it all with .pivot_table()
as follows:
# Load the data
dataset = sm.datasets.anes96.load_pandas()
# Select the columns
cols = ['PID', 'educ', 'selfLR']
df = dataset['data'].loc[:, cols]
# Decode the values
pid = {
0: 'Strong Democrat',
1: 'Weak Democrat',
2: 'Indep-Democrat',
3: 'Indep-Indep',
4: 'Indep-Republican',
5: 'Weak Republican',
6: 'Strong Republican',
}
educ = {
1: '1-8 grades',
2: 'Some high school',
3: 'High school',
4: 'Some college',
5: 'College degree',
6: "Master's",
7: 'PhD',
}
df['PID'] = df['PID'].replace(pid)
df['educ'] = df['educ'].replace(educ)
# Create a pivot table in one step
pivot_table = df.pivot_table(
index='PID', columns='educ', values='selfLR', aggfunc='mean'
)
print(pivot_table)
## educ 1-8 grades College degree High school Master's PhD Some college Some high school
## PID
## Indep-Democrat 4.0 3.384615 3.892857 3.555556 3.200000 4.266667 4.250000
## Indep-Indep NaN 4.333333 3.833333 4.333333 4.000000 3.888889 4.666667
## Indep-Republican 6.0 4.875000 5.000000 5.000000 4.812500 4.937500 4.857143
## Strong Democrat 4.2 3.117647 3.762712 2.925000 2.545455 3.000000 4.421053
## Strong Republican 6.0 5.882353 5.642857 5.735849 5.680000 5.515152 6.250000
## Weak Democrat 4.0 3.647059 4.102041 3.195122 2.782609 3.916667 3.600000
## Weak Republican NaN 4.933333 5.000000 5.000000 4.823529 5.175000 5.400000
Note that df.pivot_table()
is the same as pd.pivot_table(df)
:
pivot_table = pd.pivot_table(
df, index='PID', columns='educ', values='selfLR', aggfunc='mean'
)
print(pivot_table)
## educ 1-8 grades College degree High school Master's PhD Some college Some high school
## PID
## Indep-Democrat 4.0 3.384615 3.892857 3.555556 3.200000 4.266667 4.250000
## Indep-Indep NaN 4.333333 3.833333 4.333333 4.000000 3.888889 4.666667
## Indep-Republican 6.0 4.875000 5.000000 5.000000 4.812500 4.937500 4.857143
## Strong Democrat 4.2 3.117647 3.762712 2.925000 2.545455 3.000000 4.421053
## Strong Republican 6.0 5.882353 5.642857 5.735849 5.680000 5.515152 6.250000
## Weak Democrat 4.0 3.647059 4.102041 3.195122 2.782609 3.916667 3.600000
## Weak Republican NaN 4.933333 5.000000 5.000000 4.823529 5.175000 5.400000
Using .reset_index()
is an optional extra step that will return the data frame to the more recognisable format with numbers as the indices:
pivot_table = pivot_table.reset_index()
print(pivot_table)
## educ PID 1-8 grades College degree High school Master's PhD Some college Some high school
## 0 Indep-Democrat 4.0 3.384615 3.892857 3.555556 3.200000 4.266667 4.250000
## 1 Indep-Indep NaN 4.333333 3.833333 4.333333 4.000000 3.888889 4.666667
## 2 Indep-Republican 6.0 4.875000 5.000000 5.000000 4.812500 4.937500 4.857143
## 3 Strong Democrat 4.2 3.117647 3.762712 2.925000 2.545455 3.000000 4.421053
## 4 Strong Republican 6.0 5.882353 5.642857 5.735849 5.680000 5.515152 6.250000
## 5 Weak Democrat 4.0 3.647059 4.102041 3.195122 2.782609 3.916667 3.600000
## 6 Weak Republican NaN 4.933333 5.000000 5.000000 4.823529 5.175000 5.400000