⇦ Back

1 Packages

The code on this page uses the Statsmodels, Pandas, Seaborn and Matplotlib packages. These can be installed from the terminal with:

$ python3.11 -m pip install statsmodels
$ python3.11 -m pip install pandas
$ python3.11 -m pip install seaborn
$ python3.11 -m pip install matplotlib

Replace python3.11 with the version of Python you are using. Once installed, these packages can be imported into your Python script via the following:

from statsmodels import api as sm
import pandas as pd
import seaborn as sns
from matplotlib import pyplot as plt

2 Example Data

For this page we will use the “American National Election Survey 1996” dataset from Statsmodels (see the documentation here):

# Load the data
dataset = sm.datasets.anes96.load_pandas()

The data within this data set has 11 columns but we will only use three of them:

  • PID : The political party with which the survey respondent identified
  • educ : Education level
  • selfLR : Self-reported political leanings from “Left” to “Right” on a scale of 1 to 7

All three of these columns are numerically encoded: they contain numbers which represent their categorical meaning. We will decode the PID and educ columns as follows:

pid = {
    0: 'Strong Democrat',
    1: 'Weak Democrat',
    2: 'Indep-Democrat',
    3: 'Indep-Indep',
    4: 'Indep-Republican',
    5: 'Weak Republican',
    6: 'Strong Republican',
}
educ = {
    1: '1-8 grades',
    2: 'Some high school',
    3: 'High school',
    4: 'Some college',
    5: 'College degree',
    6: "Master's",
    7: 'PhD',
}
cols = ['PID', 'educ', 'selfLR']
df = dataset['data'].loc[:, cols]
df['PID'] = df['PID'].replace(pid)
df['educ'] = df['educ'].replace(educ)

print(df.head(10))
##                  PID          educ  selfLR
## 0  Strong Republican   High school     7.0
## 1      Weak Democrat  Some college     3.0
## 2      Weak Democrat      Master's     2.0
## 3      Weak Democrat      Master's     3.0
## 4    Strong Democrat      Master's     5.0
## 5      Weak Democrat  Some college     3.0
## 6      Weak Democrat  Some college     5.0
## 7   Indep-Republican  Some college     5.0
## 8        Indep-Indep  Some college     4.0
## 9    Strong Democrat   High school     3.0

3 Pivot Tables

A pivot table takes the groups from within a data frame and re-organises the data according to those groups. Pandas has two methods that will do this: .pivot() and .pivot_table(). The difference is that the first needs the data frame to already be grouped while the second can do the grouping, aggregating and pivoting in one step.

3.1 .pivot()

The fact that this method requires a data frame that is already grouped is not a problem because we can use the .groupby() method for this (see this page for more):

  • .groupby() outputs data in long format
  • .pivot() outputs data in wide format

Let’s have a go with getting the mean value for respondents’ political leanings, grouped by political party identification and education level:

grouped = df.groupby(['PID', 'educ'])['selfLR'].mean()
grouped = grouped.reset_index()

print(grouped.head())
##               PID            educ    selfLR
## 0  Indep-Democrat      1-8 grades  4.000000
## 1  Indep-Democrat  College degree  3.384615
## 2  Indep-Democrat     High school  3.892857
## 3  Indep-Democrat        Master's  3.555556
## 4  Indep-Democrat             PhD  3.200000

And, now that the data is grouped, we can pivot it:

pivot_table = grouped.pivot(index='PID', columns='educ', values='selfLR')

print(pivot_table)
## educ               1-8 grades  College degree  High school  Master's       PhD  Some college  Some high school
## PID                                                                                                           
## Indep-Democrat            4.0        3.384615     3.892857  3.555556  3.200000      4.266667          4.250000
## Indep-Indep               NaN        4.333333     3.833333  4.333333  4.000000      3.888889          4.666667
## Indep-Republican          6.0        4.875000     5.000000  5.000000  4.812500      4.937500          4.857143
## Strong Democrat           4.2        3.117647     3.762712  2.925000  2.545455      3.000000          4.421053
## Strong Republican         6.0        5.882353     5.642857  5.735849  5.680000      5.515152          6.250000
## Weak Democrat             4.0        3.647059     4.102041  3.195122  2.782609      3.916667          3.600000
## Weak Republican           NaN        4.933333     5.000000  5.000000  4.823529      5.175000          5.400000

3.2 Heat Map

This data is perfect for plotting in a heat map, although before we do that it would be clever to re-order the columns into their logical order. The easiest way to do that in this example is just to start again, so here we go:

# Load the data
dataset = sm.datasets.anes96.load_pandas()
# Select the columns
cols = ['PID', 'educ', 'selfLR']
df = dataset['data'].loc[:, cols]
# Group the data
grouped = df.groupby(['PID', 'educ'])['selfLR'].mean()
grouped = grouped.reset_index()
# Pivot
pivot_table = grouped.pivot(index='PID', columns='educ', values='selfLR')
pivot_table = pivot_table.rename(index=pid)
pivot_table = pivot_table.rename(columns=educ)
# Plot
label = 'Self-Reported Political Leanings'
sns.heatmap(pivot_table, annot=True, cbar_kws={'label': label})
plt.ylabel('Political Party Identified With')
plt.xlabel('Education Level')
plt.xticks(rotation=60)
plt.tight_layout()
plt.show()

3.3 .pivot_table()

Instead of chaining .groupby(), .mean(), .reset_index() and .pivot() we can do it all with .pivot_table() as follows:

# Load the data
dataset = sm.datasets.anes96.load_pandas()
# Select the columns
cols = ['PID', 'educ', 'selfLR']
df = dataset['data'].loc[:, cols]
# Decode the values
pid = {
    0: 'Strong Democrat',
    1: 'Weak Democrat',
    2: 'Indep-Democrat',
    3: 'Indep-Indep',
    4: 'Indep-Republican',
    5: 'Weak Republican',
    6: 'Strong Republican',
}
educ = {
    1: '1-8 grades',
    2: 'Some high school',
    3: 'High school',
    4: 'Some college',
    5: 'College degree',
    6: "Master's",
    7: 'PhD',
}
df['PID'] = df['PID'].replace(pid)
df['educ'] = df['educ'].replace(educ)

# Create a pivot table in one step
pivot_table = df.pivot_table(
    index='PID', columns='educ', values='selfLR', aggfunc='mean'
)

print(pivot_table)
## educ               1-8 grades  College degree  High school  Master's       PhD  Some college  Some high school
## PID                                                                                                           
## Indep-Democrat            4.0        3.384615     3.892857  3.555556  3.200000      4.266667          4.250000
## Indep-Indep               NaN        4.333333     3.833333  4.333333  4.000000      3.888889          4.666667
## Indep-Republican          6.0        4.875000     5.000000  5.000000  4.812500      4.937500          4.857143
## Strong Democrat           4.2        3.117647     3.762712  2.925000  2.545455      3.000000          4.421053
## Strong Republican         6.0        5.882353     5.642857  5.735849  5.680000      5.515152          6.250000
## Weak Democrat             4.0        3.647059     4.102041  3.195122  2.782609      3.916667          3.600000
## Weak Republican           NaN        4.933333     5.000000  5.000000  4.823529      5.175000          5.400000

Note that df.pivot_table() is the same as pd.pivot_table(df):

pivot_table = pd.pivot_table(
    df, index='PID', columns='educ', values='selfLR', aggfunc='mean'
)

print(pivot_table)
## educ               1-8 grades  College degree  High school  Master's       PhD  Some college  Some high school
## PID                                                                                                           
## Indep-Democrat            4.0        3.384615     3.892857  3.555556  3.200000      4.266667          4.250000
## Indep-Indep               NaN        4.333333     3.833333  4.333333  4.000000      3.888889          4.666667
## Indep-Republican          6.0        4.875000     5.000000  5.000000  4.812500      4.937500          4.857143
## Strong Democrat           4.2        3.117647     3.762712  2.925000  2.545455      3.000000          4.421053
## Strong Republican         6.0        5.882353     5.642857  5.735849  5.680000      5.515152          6.250000
## Weak Democrat             4.0        3.647059     4.102041  3.195122  2.782609      3.916667          3.600000
## Weak Republican           NaN        4.933333     5.000000  5.000000  4.823529      5.175000          5.400000

Using .reset_index() is an optional extra step that will return the data frame to the more recognisable format with numbers as the indices:

pivot_table = pivot_table.reset_index()

print(pivot_table)
## educ                PID  1-8 grades  College degree  High school  Master's       PhD  Some college  Some high school
## 0        Indep-Democrat         4.0        3.384615     3.892857  3.555556  3.200000      4.266667          4.250000
## 1           Indep-Indep         NaN        4.333333     3.833333  4.333333  4.000000      3.888889          4.666667
## 2      Indep-Republican         6.0        4.875000     5.000000  5.000000  4.812500      4.937500          4.857143
## 3       Strong Democrat         4.2        3.117647     3.762712  2.925000  2.545455      3.000000          4.421053
## 4     Strong Republican         6.0        5.882353     5.642857  5.735849  5.680000      5.515152          6.250000
## 5         Weak Democrat         4.0        3.647059     4.102041  3.195122  2.782609      3.916667          3.600000
## 6       Weak Republican         NaN        4.933333     5.000000  5.000000  4.823529      5.175000          5.400000

⇦ Back