This page is the fourth in a series on data frames:
- Introduction to Data Frames
- Searching, Finding & Filtering Data Frames
- Calculations with Data Frames
- Multi-Level Data Frames
A ‘normal’ data frame has rows and columns, meaning that every cell can be uniquely identified by its row name and column name. As an example, take the results of the men’s sprint final from the Rio 2016 Olympics:
import pandas as pd
pd.set_option('display.max_columns', 20)
pd.set_option('display.max_colwidth', 40)
pd.set_option('display.width', 200)
index = ['Gold', 'Silver', 'Bronze', '4th', '5th', '6th', '7th', '8th']
rio_men = {
'100m': [
'Usain Bolt', 'Justin Gatlin', 'Andre De Grasse', 'Yohan Blake', 'Akani Simbine', 'Ben Youssef Meïté',
'Jimmy Vicaut', 'Trayvon Bromell'
],
'200m': [
'Usain Bolt', 'Andre De Grasse', 'Christophe Lemaitre', 'Adam Gemili', 'Churandy Martina',
'LaShawn Merritt', 'Alonso Edward', 'Ramil Guliyev'
],
}
df = pd.DataFrame(rio_men, index=index)
print(df)
## 100m 200m
## Gold Usain Bolt Usain Bolt
## Silver Justin Gatlin Andre De Grasse
## Bronze Andre De Grasse Christophe Lemaitre
## 4th Yohan Blake Adam Gemili
## 5th Akani Simbine Churandy Martina
## 6th Ben Youssef Meïté LaShawn Merritt
## 7th Jimmy Vicaut Alonso Edward
## 8th Trayvon Bromell Ramil Guliyev
From the above data frame it’s clear that the person who came 4th in the 200m was Adam Gemili; specifying one row name and one column name identifies one cell:
print(df.loc['4th', '200m'])
## Adam Gemili
However, it’s also possible to have a multi-level index (or MultiIndex) object as the row or column names. This creates a hierarchy of names:
columns = [
('100m', 'Women'), ('100m', 'Men'), ('200m', 'Women'), ('200m', 'Men')
]
columns = pd.MultiIndex.from_tuples(columns)
index = ['Gold', 'Silver', 'Bronze', '4th', '5th', '6th', '7th', '8th']
data = [
['Elaine Thompson', 'Usain Bolt', 'Elaine Thompson', 'Usain Bolt'],
['Tori Bowie', 'Justin Gatlin', 'Dafne Schippers', 'Andre De Grasse'],
['Shelly-Ann Fraser-Pryce', 'Andre De Grasse', 'Tori Bowie', 'Christophe Lemaitre'],
['Marie-Josee Ta Lou', 'Yohan Blake', 'Marie Josée Ta Lou', 'Adam Gemili'],
['Dafne Schippers', 'Akani Simbine', 'Dina Asher-Smith', 'Churandy Martina'],
['Michelle-Lee Ahye', 'Ben Youssef Meïté', 'Michelle-Lee Ahye', 'LaShawn Merritt'],
['English Gardner', 'Jimmy Vicaut', 'Deajah Stevens', 'Alonso Edward'],
['Christania Williams', 'Trayvon Bromell', 'Ivet Lalova-Collio', 'Ramil Guliyev']
]
df = pd.DataFrame(data=data, columns=columns, index=index)
print(df)
## 100m 200m
## Women Men Women Men
## Gold Elaine Thompson Usain Bolt Elaine Thompson Usain Bolt
## Silver Tori Bowie Justin Gatlin Dafne Schippers Andre De Grasse
## Bronze Shelly-Ann Fraser-Pryce Andre De Grasse Tori Bowie Christophe Lemaitre
## 4th Marie-Josee Ta Lou Yohan Blake Marie Josée Ta Lou Adam Gemili
## 5th Dafne Schippers Akani Simbine Dina Asher-Smith Churandy Martina
## 6th Michelle-Lee Ahye Ben Youssef Meïté Michelle-Lee Ahye LaShawn Merritt
## 7th English Gardner Jimmy Vicaut Deajah Stevens Alonso Edward
## 8th Christania Williams Trayvon Bromell Ivet Lalova-Collio Ramil Guliyev
Now, in order to extract the Adam Gemili cell we need to provide three pieces of information: 4th, 200m and Men. The latter two are the two components of this cell’s multi-level column name and can be specified using a tuple - two comma-separated pieces of information enclosed with round brackets:
print(df.loc['4th', ('200m', 'Men')])
## Adam Gemili
The .stack()
method takes the column names (or the inner-most column names if there are multiple levels) and ‘stacks’ them on top of the row names to create multi-level row names:
df = df.stack()
print(df)
## 100m 200m
## Gold Men Usain Bolt Usain Bolt
## Women Elaine Thompson Elaine Thompson
## Silver Men Justin Gatlin Andre De Grasse
## Women Tori Bowie Dafne Schippers
## Bronze Men Andre De Grasse Christophe Lemaitre
## Women Shelly-Ann Fraser-Pryce Tori Bowie
## 4th Men Yohan Blake Adam Gemili
## Women Marie-Josee Ta Lou Marie Josée Ta Lou
## 5th Men Akani Simbine Churandy Martina
## Women Dafne Schippers Dina Asher-Smith
## 6th Men Ben Youssef Meïté LaShawn Merritt
## Women Michelle-Lee Ahye Michelle-Lee Ahye
## 7th Men Jimmy Vicaut Alonso Edward
## Women English Gardner Deajah Stevens
## 8th Men Trayvon Bromell Ramil Guliyev
## Women Christania Williams Ivet Lalova-Collio
In this example, the inner-most level of the column names (“Women” and “Men”) got moved to become the inner-most level of the row names. It is not the row that needs to be specified using a tuple:
print(df.loc[('4th', 'Men'), '200m'])
## Adam Gemili
This re-structuring can be used to search for values. Let’s start with a single-level data frame:
index = ['Gold', 'Silver', 'Bronze', '4th', '5th', '6th', '7th', '8th']
rio_men = {
'100m': [
'Usain Bolt', 'Justin Gatlin', 'Andre De Grasse', 'Yohan Blake', 'Akani Simbine', 'Ben Youssef Meïté',
'Jimmy Vicaut', 'Trayvon Bromell'
],
'200m': [
'Usain Bolt', 'Andre De Grasse', 'Christophe Lemaitre', 'Adam Gemili', 'Churandy Martina',
'LaShawn Merritt', 'Alonso Edward', 'Ramil Guliyev'
],
'400m': [
'Wayde van Niekerk', 'Kirani James', 'LaShawn Merritt', 'Machel Cedenio', 'Karabo Sibanda',
'Ali Khamis', 'Bralon Taplin', 'Matthew Hudson-Smith'
]
}
df = pd.DataFrame(rio_men, index=index)
print(df)
## 100m 200m 400m
## Gold Usain Bolt Usain Bolt Wayde van Niekerk
## Silver Justin Gatlin Andre De Grasse Kirani James
## Bronze Andre De Grasse Christophe Lemaitre LaShawn Merritt
## 4th Yohan Blake Adam Gemili Machel Cedenio
## 5th Akani Simbine Churandy Martina Karabo Sibanda
## 6th Ben Youssef Meïté LaShawn Merritt Ali Khamis
## 7th Jimmy Vicaut Alonso Edward Bralon Taplin
## 8th Trayvon Bromell Ramil Guliyev Matthew Hudson-Smith
What happens when we stack a single-level data frame? Let’s find out:
print(df.stack())
## Gold 100m Usain Bolt
## 200m Usain Bolt
## 400m Wayde van Niekerk
## Silver 100m Justin Gatlin
## 200m Andre De Grasse
## 400m Kirani James
## Bronze 100m Andre De Grasse
## 200m Christophe Lemaitre
## 400m LaShawn Merritt
## 4th 100m Yohan Blake
## 200m Adam Gemili
## 400m Machel Cedenio
## 5th 100m Akani Simbine
## 200m Churandy Martina
## 400m Karabo Sibanda
## 6th 100m Ben Youssef Meïté
## 200m LaShawn Merritt
## 400m Ali Khamis
## 7th 100m Jimmy Vicaut
## 200m Alonso Edward
## 400m Bralon Taplin
## 8th 100m Trayvon Bromell
## 200m Ramil Guliyev
## 400m Matthew Hudson-Smith
## dtype: object
The column names have been stacked on top of the row names (which you might have expected) but note that the result is a series, not a data frame! This is important because we can now search the entire data frame as if it was one column (which it now is):
search_result = df[df.isin(['Adam Gemili'])].stack()
row = search_result.index[0][0]
col = search_result.index[0][1]
print(f'Adam Gemili came {row} in the {col}')
## Adam Gemili came 4th in the 200m
Perhaps unsurprisingly, the reverse of the .stack()
method is the .unstack()
method:
print(df.stack().unstack())
## 100m 200m 400m
## Gold Usain Bolt Usain Bolt Wayde van Niekerk
## Silver Justin Gatlin Andre De Grasse Kirani James
## Bronze Andre De Grasse Christophe Lemaitre LaShawn Merritt
## 4th Yohan Blake Adam Gemili Machel Cedenio
## 5th Akani Simbine Churandy Martina Karabo Sibanda
## 6th Ben Youssef Meïté LaShawn Merritt Ali Khamis
## 7th Jimmy Vicaut Alonso Edward Bralon Taplin
## 8th Trayvon Bromell Ramil Guliyev Matthew Hudson-Smith