⇦ Back

This page is the fourth in a series on data frames:

  1. Introduction to Data Frames
  2. Searching, Finding & Filtering Data Frames
  3. Calculations with Data Frames
  4. Multi-Level Data Frames

1 Single-Level vs Multi-Level Data Frames

A ‘normal’ data frame has rows and columns, meaning that every cell can be uniquely identified by its row name and column name. As an example, take the results of the men’s sprint final from the Rio 2016 Olympics:

import pandas as pd

pd.set_option('display.max_columns', 20)
pd.set_option('display.max_colwidth', 40)
pd.set_option('display.width', 200)

index = ['Gold', 'Silver', 'Bronze', '4th', '5th', '6th', '7th', '8th']
rio_men = {
    '100m': [
        'Usain Bolt', 'Justin Gatlin', 'Andre De Grasse', 'Yohan Blake', 'Akani Simbine', 'Ben Youssef Meïté',
        'Jimmy Vicaut', 'Trayvon Bromell'
    ],
    '200m': [
        'Usain Bolt', 'Andre De Grasse', 'Christophe Lemaitre', 'Adam Gemili', 'Churandy Martina',
        'LaShawn Merritt', 'Alonso Edward', 'Ramil Guliyev'
    ],
}
df = pd.DataFrame(rio_men, index=index)
print(df)
##                      100m                 200m
## Gold           Usain Bolt           Usain Bolt
## Silver      Justin Gatlin      Andre De Grasse
## Bronze    Andre De Grasse  Christophe Lemaitre
## 4th           Yohan Blake          Adam Gemili
## 5th         Akani Simbine     Churandy Martina
## 6th     Ben Youssef Meïté      LaShawn Merritt
## 7th          Jimmy Vicaut        Alonso Edward
## 8th       Trayvon Bromell        Ramil Guliyev

From the above data frame it’s clear that the person who came 4th in the 200m was Adam Gemili; specifying one row name and one column name identifies one cell:

print(df.loc['4th', '200m'])
## Adam Gemili

However, it’s also possible to have a multi-level index (or MultiIndex) object as the row or column names. This creates a hierarchy of names:

columns = [
    ('100m', 'Women'), ('100m', 'Men'), ('200m', 'Women'), ('200m', 'Men')
]
columns = pd.MultiIndex.from_tuples(columns)
index = ['Gold', 'Silver', 'Bronze', '4th', '5th', '6th', '7th', '8th']
data = [
    ['Elaine Thompson', 'Usain Bolt', 'Elaine Thompson', 'Usain Bolt'],
    ['Tori Bowie', 'Justin Gatlin', 'Dafne Schippers', 'Andre De Grasse'],
    ['Shelly-Ann Fraser-Pryce', 'Andre De Grasse', 'Tori Bowie', 'Christophe Lemaitre'],
    ['Marie-Josee Ta Lou', 'Yohan Blake', 'Marie Josée Ta Lou', 'Adam Gemili'],
    ['Dafne Schippers', 'Akani Simbine', 'Dina Asher-Smith', 'Churandy Martina'],
    ['Michelle-Lee Ahye', 'Ben Youssef Meïté', 'Michelle-Lee Ahye', 'LaShawn Merritt'],
    ['English Gardner', 'Jimmy Vicaut', 'Deajah Stevens', 'Alonso Edward'],
    ['Christania Williams', 'Trayvon Bromell', 'Ivet Lalova-Collio', 'Ramil Guliyev']
]
df = pd.DataFrame(data=data, columns=columns, index=index)
print(df)
##                            100m                                   200m                     
##                           Women                Men               Women                  Men
## Gold            Elaine Thompson         Usain Bolt     Elaine Thompson           Usain Bolt
## Silver               Tori Bowie      Justin Gatlin     Dafne Schippers      Andre De Grasse
## Bronze  Shelly-Ann Fraser-Pryce    Andre De Grasse          Tori Bowie  Christophe Lemaitre
## 4th          Marie-Josee Ta Lou        Yohan Blake  Marie Josée Ta Lou          Adam Gemili
## 5th             Dafne Schippers      Akani Simbine    Dina Asher-Smith     Churandy Martina
## 6th           Michelle-Lee Ahye  Ben Youssef Meïté   Michelle-Lee Ahye      LaShawn Merritt
## 7th             English Gardner       Jimmy Vicaut      Deajah Stevens        Alonso Edward
## 8th         Christania Williams    Trayvon Bromell  Ivet Lalova-Collio        Ramil Guliyev

Now, in order to extract the Adam Gemili cell we need to provide three pieces of information: 4th, 200m and Men. The latter two are the two components of this cell’s multi-level column name and can be specified using a tuple - two comma-separated pieces of information enclosed with round brackets:

print(df.loc['4th', ('200m', 'Men')])
## Adam Gemili

2 Re-Structuring Multi-Level Data Frames

The .stack() method takes the column names (or the inner-most column names if there are multiple levels) and ‘stacks’ them on top of the row names to create multi-level row names:

df = df.stack()
print(df)
##                                  100m                 200m
## Gold   Men                 Usain Bolt           Usain Bolt
##        Women          Elaine Thompson      Elaine Thompson
## Silver Men              Justin Gatlin      Andre De Grasse
##        Women               Tori Bowie      Dafne Schippers
## Bronze Men            Andre De Grasse  Christophe Lemaitre
##        Women  Shelly-Ann Fraser-Pryce           Tori Bowie
## 4th    Men                Yohan Blake          Adam Gemili
##        Women       Marie-Josee Ta Lou   Marie Josée Ta Lou
## 5th    Men              Akani Simbine     Churandy Martina
##        Women          Dafne Schippers     Dina Asher-Smith
## 6th    Men          Ben Youssef Meïté      LaShawn Merritt
##        Women        Michelle-Lee Ahye    Michelle-Lee Ahye
## 7th    Men               Jimmy Vicaut        Alonso Edward
##        Women          English Gardner       Deajah Stevens
## 8th    Men            Trayvon Bromell        Ramil Guliyev
##        Women      Christania Williams   Ivet Lalova-Collio

In this example, the inner-most level of the column names (“Women” and “Men”) got moved to become the inner-most level of the row names. It is not the row that needs to be specified using a tuple:

print(df.loc[('4th', 'Men'), '200m'])
## Adam Gemili

2.1 Searching for Values

This re-structuring can be used to search for values. Let’s start with a single-level data frame:

index = ['Gold', 'Silver', 'Bronze', '4th', '5th', '6th', '7th', '8th']
rio_men = {
    '100m': [
        'Usain Bolt', 'Justin Gatlin', 'Andre De Grasse', 'Yohan Blake', 'Akani Simbine', 'Ben Youssef Meïté',
        'Jimmy Vicaut', 'Trayvon Bromell'
    ],
    '200m': [
        'Usain Bolt', 'Andre De Grasse', 'Christophe Lemaitre', 'Adam Gemili', 'Churandy Martina',
        'LaShawn Merritt', 'Alonso Edward', 'Ramil Guliyev'
    ],
    '400m': [
        'Wayde van Niekerk', 'Kirani James', 'LaShawn Merritt', 'Machel Cedenio', 'Karabo Sibanda',
        'Ali Khamis', 'Bralon Taplin', 'Matthew Hudson-Smith'
    ]
}
df = pd.DataFrame(rio_men, index=index)
print(df)
##                      100m                 200m                  400m
## Gold           Usain Bolt           Usain Bolt     Wayde van Niekerk
## Silver      Justin Gatlin      Andre De Grasse          Kirani James
## Bronze    Andre De Grasse  Christophe Lemaitre       LaShawn Merritt
## 4th           Yohan Blake          Adam Gemili        Machel Cedenio
## 5th         Akani Simbine     Churandy Martina        Karabo Sibanda
## 6th     Ben Youssef Meïté      LaShawn Merritt            Ali Khamis
## 7th          Jimmy Vicaut        Alonso Edward         Bralon Taplin
## 8th       Trayvon Bromell        Ramil Guliyev  Matthew Hudson-Smith

What happens when we stack a single-level data frame? Let’s find out:

print(df.stack())
## Gold    100m              Usain Bolt
##         200m              Usain Bolt
##         400m       Wayde van Niekerk
## Silver  100m           Justin Gatlin
##         200m         Andre De Grasse
##         400m            Kirani James
## Bronze  100m         Andre De Grasse
##         200m     Christophe Lemaitre
##         400m         LaShawn Merritt
## 4th     100m             Yohan Blake
##         200m             Adam Gemili
##         400m          Machel Cedenio
## 5th     100m           Akani Simbine
##         200m        Churandy Martina
##         400m          Karabo Sibanda
## 6th     100m       Ben Youssef Meïté
##         200m         LaShawn Merritt
##         400m              Ali Khamis
## 7th     100m            Jimmy Vicaut
##         200m           Alonso Edward
##         400m           Bralon Taplin
## 8th     100m         Trayvon Bromell
##         200m           Ramil Guliyev
##         400m    Matthew Hudson-Smith
## dtype: object

The column names have been stacked on top of the row names (which you might have expected) but note that the result is a series, not a data frame! This is important because we can now search the entire data frame as if it was one column (which it now is):

search_result = df[df.isin(['Adam Gemili'])].stack()
row = search_result.index[0][0]
col = search_result.index[0][1]
print(f'Adam Gemili came {row} in the {col}')
## Adam Gemili came 4th in the 200m

2.2 Unstacking

Perhaps unsurprisingly, the reverse of the .stack() method is the .unstack() method:

print(df.stack().unstack())
##                      100m                 200m                  400m
## Gold           Usain Bolt           Usain Bolt     Wayde van Niekerk
## Silver      Justin Gatlin      Andre De Grasse          Kirani James
## Bronze    Andre De Grasse  Christophe Lemaitre       LaShawn Merritt
## 4th           Yohan Blake          Adam Gemili        Machel Cedenio
## 5th         Akani Simbine     Churandy Martina        Karabo Sibanda
## 6th     Ben Youssef Meïté      LaShawn Merritt            Ali Khamis
## 7th          Jimmy Vicaut        Alonso Edward         Bralon Taplin
## 8th       Trayvon Bromell        Ramil Guliyev  Matthew Hudson-Smith

⇦ Back