This page is the first part of a series on data frames:
- Introduction to Data Frames
- Searching, Finding & Filtering Data Frames
- Calculations with Data Frames
- Multi-Level Data Frames
A data frame is a table. It is the programming equivalent of an Excel spreadsheet. It has rows and columns with row headings (known as indexes/indices) and column headings.
Data frames are not one of the object types that get included in Python by default. In order to unlock them, you need to install the ‘Pandas’ package and then import it into each script where you want to use them:
pip3 install pandas
import pandas as pd
pd.
“Pandas” is short for “panel data” which is a term for the type of information you get when you take multiple measurements over a number of time points.
Create from dictionary:
A data frame can be ‘manually’ created by using the DataFrame()
function from Pandas. This takes as its input a dictionary object. It turns this into a data frame using the dictionary’s keys as the column headings and the dictionary’s values as the rows. Note that all the columns need to be the same length. In other words, each of the dictionary’s values (which will usually be lists) need to have the same number of elements.
For this page we will use the results of the men’s 100m finals from the Rio 2016 Olympics as our example data:
import pandas as pd
rio_100m_men = {
'Athlete': [
'Usain Bolt', 'Justin Gatlin', 'Andre De Grasse', 'Yohan Blake', 'Akani Simbine', 'Ben Youssef Meïté',
'Jimmy Vicaut', 'Trayvon Bromell'
],
'Country': [
'Jamaica', 'United States', 'Canada', 'Jamaica', 'South Africa', 'Ivory Coast', 'France', 'United States'
],
'Age': [29, 34, 21, 26, 22, 29, 24, 21],
'Heat': [10.07, 10.01, 10.04, 10.11, 10.14, 10.03, 10.19, 10.13],
'Semi-Final': [9.86, 9.94, 9.92, 10.01, 9.98, 9.97, 9.95, 10.01],
'Final': [9.81, 9.89, 9.91, 9.93, 9.94, 9.96, 10.04, 10.06],
}
df = pd.DataFrame(rio_100m_men)
print(df)
## Athlete Country Age Heat Semi-Final Final
## 0 Usain Bolt Jamaica 29 10.07 9.86 9.81
## 1 Justin Gatlin United States 34 10.01 9.94 9.89
## 2 Andre De Grasse Canada 21 10.04 9.92 9.91
## 3 Yohan Blake Jamaica 26 10.11 10.01 9.93
## 4 Akani Simbine South Africa 22 10.14 9.98 9.94
## 5 Ben Youssef Meïté Ivory Coast 29 10.03 9.97 9.96
## 6 Jimmy Vicaut France 24 10.19 9.95 10.04
## 7 Trayvon Bromell United States 21 10.13 10.01 10.06
As you can see, we started with a dictionary rio_100m_men
which had the same number of elements (8) in each value. This was then converted into a Pandas data frame and each key became a column heading and each value became a column corresponding to its key. Of course, the fact that Python is a ‘zero-indexed’ language means that Usain Bolt is shown in the above table as having come ‘0th’! That doesn’t make much sense so let’s re-create the data frame using a custom index (custom row headings):
df = pd.DataFrame({
'Athlete': [
'Usain Bolt', 'Justin Gatlin', 'Andre De Grasse', 'Yohan Blake', 'Akani Simbine',
'Ben Youssef Meïté', 'Jimmy Vicaut', 'Trayvon Bromell'
],
'Country': [
'Jamaica', 'United States', 'Canada', 'Jamaica', 'South Africa', 'Ivory Coast', 'France', 'United States'
],
'Age': [29, 34, 21, 26, 22, 29, 24, 21],
'Heat': [10.07, 10.01, 10.04, 10.11, 10.14, 10.03, 10.19, 10.13],
'Semi-Final': [9.86, 9.94, 9.92, 10.01, 9.98, 9.97, 9.95, 10.01],
'Final': [9.81, 9.89, 9.91, 9.93, 9.94, 9.96, 10.04, 10.06],
}, index=[1, 2, 3, 4, 5, 6, 7, 8]
)
print(df)
## Athlete Country Age Heat Semi-Final Final
## 1 Usain Bolt Jamaica 29 10.07 9.86 9.81
## 2 Justin Gatlin United States 34 10.01 9.94 9.89
## 3 Andre De Grasse Canada 21 10.04 9.92 9.91
## 4 Yohan Blake Jamaica 26 10.11 10.01 9.93
## 5 Akani Simbine South Africa 22 10.14 9.98 9.94
## 6 Ben Youssef Meïté Ivory Coast 29 10.03 9.97 9.96
## 7 Jimmy Vicaut France 24 10.19 9.95 10.04
## 8 Trayvon Bromell United States 21 10.13 10.01 10.06
Create from array or lists:
Alternatively, a data frame can be created from a Numpy array and/or a list-of-lists instead of from a dictionary:
import numpy as np
# Create an array from a list-of-lists
rio_100m_men = np.array([
['Athlete', 'Country', 'Age', 'Heat', 'Semi-Final', 'Final'],
['Usain Bolt', 'Jamaica', 29, 10.07, 9.86, 9.81],
['Justin Gatlin', 'United States', 34, 10.01, 9.94, 9.89],
['Andre De Grasse', 'Canada', 21, 10.04, 9.92, 9.91],
['Yohan Blake', 'Jamaica', 26, 10.11, 10.01, 9.93],
['Akani Simbine', 'South Africa', 22, 10.14, 9.98, 9.94],
['Ben Youssef Meïté', 'Ivory Coast', 29, 10.03, 9.97, 9.96],
['Jimmy Vicaut', 'France', 24, 10.19, 9.95, 10.04],
['Trayvon Bromell', 'United States', 21, 10.13, 10.01, 10.06],
])
# Convert the array to a data frame
df = pd.DataFrame(
data=rio_100m_men[1:, :],
columns=rio_100m_men[0, :]
)
print(df)
## Athlete Country Age Heat Semi-Final Final
## 0 Usain Bolt Jamaica 29 10.07 9.86 9.81
## 1 Justin Gatlin United States 34 10.01 9.94 9.89
## 2 Andre De Grasse Canada 21 10.04 9.92 9.91
## 3 Yohan Blake Jamaica 26 10.11 10.01 9.93
## 4 Akani Simbine South Africa 22 10.14 9.98 9.94
## 5 Ben Youssef Meïté Ivory Coast 29 10.03 9.97 9.96
## 6 Jimmy Vicaut France 24 10.19 9.95 10.04
## 7 Trayvon Bromell United States 21 10.13 10.01 10.06
Import from file:
See the page on file IO for more info on importing data frames.
# Import comma-separated values (.csv file) as a data frame
df = pd.read_csv('data frame.csv')
Import from URL:
Any data object on the internet that could be interpreted as a data frame can be imported:
# Import comma-separated values (.csv file) as a data frame
data_url = 'https://raw.githubusercontent.com/rowannicholls/rowannicholls.github.io/master/python/advanced/data%20frame.csv'
df = pd.read_csv(data_url)
Use the .head()
method to see only the first 5 rows of a data frame (or .head(n)
to see the first n rows). This is useful if you have a lot of data!
print(df.head())
## Athlete Country Age Heat Semi-Final Final
## 0 Usain Bolt Jamaica 29 10.07 9.86 9.81
## 1 Justin Gatlin United States 34 10.01 9.94 9.89
## 2 Andre De Grasse Canada 21 10.04 9.92 9.91
## 3 Yohan Blake Jamaica 26 10.11 10.01 9.93
## 4 Akani Simbine South Africa 22 10.14 9.98 9.94
Similarly, .tail()
will show the last 5 rows (and .tail(n)
will show the last n rows):
print(df.tail())
## Athlete Country Age Heat Semi-Final Final
## 3 Yohan Blake Jamaica 26 10.11 10.01 9.93
## 4 Akani Simbine South Africa 22 10.14 9.98 9.94
## 5 Ben Youssef Meïté Ivory Coast 29 10.03 9.97 9.96
## 6 Jimmy Vicaut France 24 10.19 9.95 10.04
## 7 Trayvon Bromell United States 21 10.13 10.01 10.06
Change the amount of information that gets shown in your console when you print a data frame by tweaking the below options:
pd.set_option('display.max_rows', 3)
pd.set_option('display.max_columns', 20)
pd.set_option('display.max_colwidth', 40)
pd.set_option('display.width', 200)
pd.set_option('display.precision', 3)
print(df)
## Athlete Country Age Heat Semi-Final Final
## 0 Usain Bolt Jamaica 29 10.07 9.86 9.81
## .. ... ... ... ... ... ...
## 7 Trayvon Bromell United States 21 10.13 10.01 10.06
##
## [8 rows x 6 columns]
Reset these options with the following
pd.reset_option('display.max_rows')
pd.reset_option('display.max_columns')
pd.reset_option('display.max_colwidth')
pd.reset_option('display.precision')
pd.reset_option('display.width')
print(df)
## Athlete Country Age Heat Semi-Final Final
## 0 Usain Bolt Jamaica 29 10.07 9.86 9.81
## 1 Justin Gatlin United States 34 10.01 9.94 9.89
## 2 Andre De Grasse Canada 21 10.04 9.92 9.91
## 3 Yohan Blake Jamaica 26 10.11 10.01 9.93
## 4 Akani Simbine South Africa 22 10.14 9.98 9.94
## 5 Ben Youssef Meïté Ivory Coast 29 10.03 9.97 9.96
## 6 Jimmy Vicaut France 24 10.19 9.95 10.04
## 7 Trayvon Bromell United States 21 10.13 10.01 10.06
For more info about the display options, see here: https://pandas.pydata.org/pandas-docs/stable/user_guide/options.html
Get the names of the columns:
# Get the column names as an 'Index object'
colnames = df.columns
# Get the column names as a list
colnames = df.columns.to_list()
# or
colnames = list(df.columns)
# or
colnames = list(df)
print(colnames)
## ['Athlete', 'Country', 'Age', 'Heat', 'Semi-Final', 'Final']
Notice from the above that when you convert a data frame into a list you are only left with the column names.
The ‘row names’ are more properly called the ‘indices’ (or ‘indexes’ if you’re American):
# Get the row names as an 'Index object'
rownames = df.index
# Get the row names as a list
rownames = df.index.to_list()
# or
rownames = list(df.index)
print(rownames)
## [0, 1, 2, 3, 4, 5, 6, 7]
Get the number of columns:
ncols = df.shape[1]
# or
ncols = len(list(df))
print(ncols)
## 6
Get the number of rows:
nrows = df.shape[0]
# or
nrows = len(df)
print(nrows)
## 8
Display the data types of the values in each column:
print(df.dtypes)
## Athlete object
## Country object
## Age int64
## Heat float64
## Semi-Final float64
## Final float64
## dtype: object
Display a summary of the information in a data frame:
print(df.info())
## <class 'pandas.core.frame.DataFrame'>
## RangeIndex: 8 entries, 0 to 7
## Data columns (total 6 columns):
## # Column Non-Null Count Dtype
## --- ------ -------------- -----
## 0 Athlete 8 non-null object
## 1 Country 8 non-null object
## 2 Age 8 non-null int64
## 3 Heat 8 non-null float64
## 4 Semi-Final 8 non-null float64
## 5 Final 8 non-null float64
## dtypes: float64(3), int64(1), object(2)
## memory usage: 512.0+ bytes
## None
Display summary statistics for each column (use include='all'
to see stats for categorical columns in addition to numeric ones):
print(df.describe(include='all'))
## Athlete Country Age Heat Semi-Final Final
## count 8 8 8.000000 8.000000 8.000000 8.000000
## unique 8 6 NaN NaN NaN NaN
## top Usain Bolt Jamaica NaN NaN NaN NaN
## freq 1 2 NaN NaN NaN NaN
## mean NaN NaN 25.750000 10.090000 9.955000 9.942500
## std NaN NaN 4.652188 0.062564 0.049857 0.080312
## min NaN NaN 21.000000 10.010000 9.860000 9.810000
## 25% NaN NaN 21.750000 10.037500 9.935000 9.905000
## 50% NaN NaN 25.000000 10.090000 9.960000 9.935000
## 75% NaN NaN 29.000000 10.132500 9.987500 9.980000
## max NaN NaN 34.000000 10.190000 10.010000 10.060000
Get the count of values in a column:
print(df['Country'].value_counts())
## Country
## Jamaica 2
## United States 2
## Canada 1
## South Africa 1
## Ivory Coast 1
## France 1
## Name: count, dtype: int64
Get a list of the unique values in a column (ie show the countries that were represented in the 100m final, without repeats):
unique = pd.unique(df['Country'])
print(unique)
## ['Jamaica' 'United States' 'Canada' 'South Africa' 'Ivory Coast' 'France']
Check if a data frame is empty or not:
# Create an empty data frame
df_is_empty = pd.DataFrame({'A': []}).empty
print(df_is_empty)
## True
Append a data frame as a new row by concatenating it to the bottom of the existing data frame via the concat()
function. This is the recommended way of doing it as this method preserves the order of the columns:
new_row_dct = {
'Athlete': 'Chijindu Ujah',
'Country': 'Great Britain',
'Age': 22,
'Heat': 10.13,
'Semi-Final': 10.01,
'Final': None,
}
new_row_df = pd.DataFrame(new_row_dct, index=[1])
df_extended = pd.concat([df, new_row_df], ignore_index=True)
print(df_extended)
## Athlete Country Age Heat Semi-Final Final
## 0 Usain Bolt Jamaica 29 10.07 9.86 9.81
## 1 Justin Gatlin United States 34 10.01 9.94 9.89
## 2 Andre De Grasse Canada 21 10.04 9.92 9.91
## 3 Yohan Blake Jamaica 26 10.11 10.01 9.93
## 4 Akani Simbine South Africa 22 10.14 9.98 9.94
## 5 Ben Youssef Meïté Ivory Coast 29 10.03 9.97 9.96
## 6 Jimmy Vicaut France 24 10.19 9.95 10.04
## 7 Trayvon Bromell United States 21 10.13 10.01 10.06
## 8 Chijindu Ujah Great Britain 22 10.13 10.01 None
To append a series as a new row you need to convert it into a data frame with .to_frame()
, then transpose it with .T
and also ignore the index:
new_row_dct = {
'Athlete': 'Chijindu Ujah',
'Country': 'Great Britain',
'Age': 22,
'Heat': 10.13,
'Semi-Final': 10.01,
'Final': None,
}
new_row_ser = pd.Series(new_row_dct)
df_extended = pd.concat([df, new_row_ser.to_frame().T], ignore_index=True)
print(df_extended)
## Athlete Country Age Heat Semi-Final Final
## 0 Usain Bolt Jamaica 29 10.07 9.86 9.81
## 1 Justin Gatlin United States 34 10.01 9.94 9.89
## 2 Andre De Grasse Canada 21 10.04 9.92 9.91
## 3 Yohan Blake Jamaica 26 10.11 10.01 9.93
## 4 Akani Simbine South Africa 22 10.14 9.98 9.94
## 5 Ben Youssef Meïté Ivory Coast 29 10.03 9.97 9.96
## 6 Jimmy Vicaut France 24 10.19 9.95 10.04
## 7 Trayvon Bromell United States 21 10.13 10.01 10.06
## 8 Chijindu Ujah Great Britain 22 10.13 10.01 None
If a dictionary’s values are lists like in the example below, you can still convert it into a data frame and append it as normal:
new_row_dct = {
'Athlete': ['Chijindu Ujah'],
'Country': ['Great Britain'],
'Age': [22],
'Heat': [10.13],
'Semi-Final': [10.01],
'Final': [None],
}
new_row_df = pd.DataFrame(new_row_dct)
df_extended = pd.concat([df, new_row_df], ignore_index=True)
print(df_extended)
## Athlete Country Age Heat Semi-Final Final
## 0 Usain Bolt Jamaica 29 10.07 9.86 9.81
## 1 Justin Gatlin United States 34 10.01 9.94 9.89
## 2 Andre De Grasse Canada 21 10.04 9.92 9.91
## 3 Yohan Blake Jamaica 26 10.11 10.01 9.93
## 4 Akani Simbine South Africa 22 10.14 9.98 9.94
## 5 Ben Youssef Meïté Ivory Coast 29 10.03 9.97 9.96
## 6 Jimmy Vicaut France 24 10.19 9.95 10.04
## 7 Trayvon Bromell United States 21 10.13 10.01 10.06
## 8 Chijindu Ujah Great Britain 22 10.13 10.01 None
Append a list as a new row:
df.loc[df.shape[0]] = ['Chijindu Ujah', 'Great Britain', 22, 10.13, 10.01, None]
print(df)
## Athlete Country Age Heat Semi-Final Final
## 0 Usain Bolt Jamaica 29 10.07 9.86 9.81
## 1 Justin Gatlin United States 34 10.01 9.94 9.89
## 2 Andre De Grasse Canada 21 10.04 9.92 9.91
## 3 Yohan Blake Jamaica 26 10.11 10.01 9.93
## 4 Akani Simbine South Africa 22 10.14 9.98 9.94
## 5 Ben Youssef Meïté Ivory Coast 29 10.03 9.97 9.96
## 6 Jimmy Vicaut France 24 10.19 9.95 10.04
## 7 Trayvon Bromell United States 21 10.13 10.01 10.06
## 8 Chijindu Ujah Great Britain 22 10.13 10.01 NaN
The code above works best if the data frame has the ‘default’ row indices, ie if the rows are numbered sequentially from 0. If the data frame is instead a subset of a larger data frame (and thus doesn’t have sequential row indices) or if the indices are strings (see the “rename the rows” section), this code might still work but not produce the expected result:
df_indexed = df.set_index('Athlete')
df_indexed.loc[df_indexed.shape[0]] = [
'Great Britain', 22, 10.13, 10.01, None
]
# This probably isn't what you want
print(df_indexed)
## Country Age Heat Semi-Final Final
## Athlete
## Usain Bolt Jamaica 29 10.07 9.86 9.81
## Justin Gatlin United States 34 10.01 9.94 9.89
## Andre De Grasse Canada 21 10.04 9.92 9.91
## Yohan Blake Jamaica 26 10.11 10.01 9.93
## Akani Simbine South Africa 22 10.14 9.98 9.94
## Ben Youssef Meïté Ivory Coast 29 10.03 9.97 9.96
## Jimmy Vicaut France 24 10.19 9.95 10.04
## Trayvon Bromell United States 21 10.13 10.01 10.06
## 8 Great Britain 22 10.13 10.01 NaN
Whereas ‘appending’ a row will add it onto the bottom of a data frame, the code below ‘inserts’ it into whatever position you specify (in this example it is index 5):
new_row = ['Chijindu Ujah', 'Great Britain', 22, 10.13, 10.01, None]
new_row = np.array([new_row])
new_row = pd.DataFrame(new_row, columns=list(df))
insert_at = 5
objs = [df.iloc[:insert_at, :], new_row, df.iloc[insert_at:, :]]
df_new = pd.concat(objs, ignore_index=True)
print(df_new)
## Athlete Country Age Heat Semi-Final Final
## 0 Usain Bolt Jamaica 29 10.07 9.86 9.81
## 1 Justin Gatlin United States 34 10.01 9.94 9.89
## 2 Andre De Grasse Canada 21 10.04 9.92 9.91
## 3 Yohan Blake Jamaica 26 10.11 10.01 9.93
## 4 Akani Simbine South Africa 22 10.14 9.98 9.94
## 5 Chijindu Ujah Great Britain 22 10.13 10.01 NaN
## 6 Ben Youssef Meïté Ivory Coast 29 10.03 9.97 9.96
## 7 Jimmy Vicaut France 24 10.19 9.95 10.04
## 8 Trayvon Bromell United States 21 10.13 10.01 10.06
Similarly, the above code works best with the default indices (0, 1, 2, etc).
Instead of adding one row to the bottom of a data frame, now we are adding a whole data frame onto the bottom:
new_row_df = pd.DataFrame({
'Athlete': ['Jak Ali Harvey', 'Nickel Ashmeade', 'Christophe Lemaitre'],
'Country': ['Turkey', 'Jamaica', 'France'],
'Age': [27, 26, 26],
'Heat': [10.14, 10.13, 10.16],
'Semi-Final': [10.03, 10.05, 10.07],
'Final': [None, None, None],
})
new_df = pd.concat([df, new_row_df], ignore_index=True)
print(new_df)
## Athlete Country Age Heat Semi-Final Final
## 0 Usain Bolt Jamaica 29 10.07 9.86 9.81
## 1 Justin Gatlin United States 34 10.01 9.94 9.89
## 2 Andre De Grasse Canada 21 10.04 9.92 9.91
## 3 Yohan Blake Jamaica 26 10.11 10.01 9.93
## 4 Akani Simbine South Africa 22 10.14 9.98 9.94
## 5 Ben Youssef Meïté Ivory Coast 29 10.03 9.97 9.96
## 6 Jimmy Vicaut France 24 10.19 9.95 10.04
## 7 Trayvon Bromell United States 21 10.13 10.01 10.06
## 8 Jak Ali Harvey Turkey 27 10.14 10.03 None
## 9 Nickel Ashmeade Jamaica 26 10.13 10.05 None
## 10 Christophe Lemaitre France 26 10.16 10.07 None
‘Merging’ is a more complicated form of combining data frames than ‘appending’ or ‘concatenating’. The two objects get combined such that where they have matching values the data gets added together in the same row as opposed to in a new row. It’s complicated enough that I’ve made a whole separate page on merging.
You can rename rows (as many as you want at one time) using a dictionary together with the rename()
function:
# Format is {old_name: new_name}
df = df.rename(index={1: 'Row 1'})
print(df)
## Athlete Country Age Heat Semi-Final Final
## 0 Usain Bolt Jamaica 29 10.07 9.86 9.81
## Row 1 Justin Gatlin United States 34 10.01 9.94 9.89
## 2 Andre De Grasse Canada 21 10.04 9.92 9.91
## 3 Yohan Blake Jamaica 26 10.11 10.01 9.93
## 4 Akani Simbine South Africa 22 10.14 9.98 9.94
## 5 Ben Youssef Meïté Ivory Coast 29 10.03 9.97 9.96
## 6 Jimmy Vicaut France 24 10.19 9.95 10.04
## 7 Trayvon Bromell United States 21 10.13 10.01 10.06
# Format is {old_name: new_name}
df = df.rename(
index={
2: 'Row 2',
3: 'Row 3',
4: 'Row 4',
}
)
print(df)
## Athlete Country Age Heat Semi-Final Final
## 0 Usain Bolt Jamaica 29 10.07 9.86 9.81
## Row 1 Justin Gatlin United States 34 10.01 9.94 9.89
## Row 2 Andre De Grasse Canada 21 10.04 9.92 9.91
## Row 3 Yohan Blake Jamaica 26 10.11 10.01 9.93
## Row 4 Akani Simbine South Africa 22 10.14 9.98 9.94
## 5 Ben Youssef Meïté Ivory Coast 29 10.03 9.97 9.96
## 6 Jimmy Vicaut France 24 10.19 9.95 10.04
## 7 Trayvon Bromell United States 21 10.13 10.01 10.06
Reset the row names (row indices) but keep the old ones in a column called “index”:
df_reset = df.reset_index()
print(df_reset)
## index Athlete Country Age Heat Semi-Final Final
## 0 0 Usain Bolt Jamaica 29 10.07 9.86 9.81
## 1 Row 1 Justin Gatlin United States 34 10.01 9.94 9.89
## 2 Row 2 Andre De Grasse Canada 21 10.04 9.92 9.91
## 3 Row 3 Yohan Blake Jamaica 26 10.11 10.01 9.93
## 4 Row 4 Akani Simbine South Africa 22 10.14 9.98 9.94
## 5 5 Ben Youssef Meïté Ivory Coast 29 10.03 9.97 9.96
## 6 6 Jimmy Vicaut France 24 10.19 9.95 10.04
## 7 7 Trayvon Bromell United States 21 10.13 10.01 10.06
Reset the row names completely:
df = df.reset_index(drop = True)
print(df)
## Athlete Country Age Heat Semi-Final Final
## 0 Usain Bolt Jamaica 29 10.07 9.86 9.81
## 1 Justin Gatlin United States 34 10.01 9.94 9.89
## 2 Andre De Grasse Canada 21 10.04 9.92 9.91
## 3 Yohan Blake Jamaica 26 10.11 10.01 9.93
## 4 Akani Simbine South Africa 22 10.14 9.98 9.94
## 5 Ben Youssef Meïté Ivory Coast 29 10.03 9.97 9.96
## 6 Jimmy Vicaut France 24 10.19 9.95 10.04
## 7 Trayvon Bromell United States 21 10.13 10.01 10.06
Use a column as the row names (row indices):
Up until now we have been using the row indices that were auto-generated when the data frame was created: the rows were numbered from 0 at the top to 7 at the bottom. However, we don’t have to use those numbers and, in fact, we can use words at the row indices if we want. This is how we can use the athletes’ names as the row indices (the .set_index()
function is needed):
df_indexed = df.set_index('Athlete')
print(df_indexed)
## Country Age Heat Semi-Final Final
## Athlete
## Usain Bolt Jamaica 29 10.07 9.86 9.81
## Justin Gatlin United States 34 10.01 9.94 9.89
## Andre De Grasse Canada 21 10.04 9.92 9.91
## Yohan Blake Jamaica 26 10.11 10.01 9.93
## Akani Simbine South Africa 22 10.14 9.98 9.94
## Ben Youssef Meïté Ivory Coast 29 10.03 9.97 9.96
## Jimmy Vicaut France 24 10.19 9.95 10.04
## Trayvon Bromell United States 21 10.13 10.01 10.06
This can make it easier to index the data frame as we can now use row ‘names’ instead of row ‘indices’:
# Get certain rows and certain columns as a data frame
subset = df_indexed.loc[['Justin Gatlin', 'Jimmy Vicaut'], ['Country', 'Age']]
print(subset)
## Country Age
## Athlete
## Justin Gatlin United States 34
## Jimmy Vicaut France 24
# Get one row and all columns as a series
cols = df_indexed.loc['Jimmy Vicaut', :]
print(cols)
## Country France
## Age 24
## Heat 10.19
## Semi-Final 9.95
## Final 10.04
## Name: Jimmy Vicaut, dtype: object
# Get one row and one column as a value
cell = df_indexed.loc['Jimmy Vicaut', 'Age']
print(cell)
## 24
Delete row using row name:
df_deleted = df.drop(1)
print(df_deleted)
## Athlete Country Age Heat Semi-Final Final
## 0 Usain Bolt Jamaica 29 10.07 9.86 9.81
## 2 Andre De Grasse Canada 21 10.04 9.92 9.91
## 3 Yohan Blake Jamaica 26 10.11 10.01 9.93
## 4 Akani Simbine South Africa 22 10.14 9.98 9.94
## 5 Ben Youssef Meïté Ivory Coast 29 10.03 9.97 9.96
## 6 Jimmy Vicaut France 24 10.19 9.95 10.04
## 7 Trayvon Bromell United States 21 10.13 10.01 10.06
Delete row using row index (the below example looks exactly the same as the above example because the row names and row indices of our data frame happen to be the same, but this will not necessarily always be the case!):
df_deleted = df.drop(df.index[1])
print(df_deleted)
## Athlete Country Age Heat Semi-Final Final
## 0 Usain Bolt Jamaica 29 10.07 9.86 9.81
## 2 Andre De Grasse Canada 21 10.04 9.92 9.91
## 3 Yohan Blake Jamaica 26 10.11 10.01 9.93
## 4 Akani Simbine South Africa 22 10.14 9.98 9.94
## 5 Ben Youssef Meïté Ivory Coast 29 10.03 9.97 9.96
## 6 Jimmy Vicaut France 24 10.19 9.95 10.04
## 7 Trayvon Bromell United States 21 10.13 10.01 10.06
Delete multiple rows by name:
df_deleted = df.drop([1, 2, 5])
print(df_deleted)
## Athlete Country Age Heat Semi-Final Final
## 0 Usain Bolt Jamaica 29 10.07 9.86 9.81
## 3 Yohan Blake Jamaica 26 10.11 10.01 9.93
## 4 Akani Simbine South Africa 22 10.14 9.98 9.94
## 6 Jimmy Vicaut France 24 10.19 9.95 10.04
## 7 Trayvon Bromell United States 21 10.13 10.01 10.06
Delete rows that have a certain value in a particular column (in this example: delete all rows where the athlete is 30 years old):
df.drop(df[df['Age'] == 30].index)
Delete rows that have one of a list of certain values in a particular column (in this example: delete all rows where the athlete is 30, 31 or 32 years old):
df.drop(df[df['Age'].isin([30, 31, 32])].index)
To modify an existing data frame by adding columns:
df['Reaction Time'] = [0.155, 0.152, 0.141, 0.145, 0.128, 0.156, 0.140, 0.135]
df['Finish Position'] = ['Gold', 'Silver', 'Bronze', '4th', '5th', '6th', '7th', '8th']
print(df)
## Athlete Country Age Heat Semi-Final Final Reaction Time Finish Position
## 0 Usain Bolt Jamaica 29 10.07 9.86 9.81 0.155 Gold
## 1 Justin Gatlin United States 34 10.01 9.94 9.89 0.152 Silver
## 2 Andre De Grasse Canada 21 10.04 9.92 9.91 0.141 Bronze
## 3 Yohan Blake Jamaica 26 10.11 10.01 9.93 0.145 4th
## 4 Akani Simbine South Africa 22 10.14 9.98 9.94 0.128 5th
## 5 Ben Youssef Meïté Ivory Coast 29 10.03 9.97 9.96 0.156 6th
## 6 Jimmy Vicaut France 24 10.19 9.95 10.04 0.140 7th
## 7 Trayvon Bromell United States 21 10.13 10.01 10.06 0.135 8th
To create a new data frame that has a copy of the original data plus the new column(s), use the .assign()
method:
df_new = df.assign(Other=['A', 'B', 'C', 'D', 'E', 'F', 'G', 'H'])
print(df_new)
## Athlete Country Age Heat Semi-Final Final Reaction Time Finish Position Other
## 0 Usain Bolt Jamaica 29 10.07 9.86 9.81 0.155 Gold A
## 1 Justin Gatlin United States 34 10.01 9.94 9.89 0.152 Silver B
## 2 Andre De Grasse Canada 21 10.04 9.92 9.91 0.141 Bronze C
## 3 Yohan Blake Jamaica 26 10.11 10.01 9.93 0.145 4th D
## 4 Akani Simbine South Africa 22 10.14 9.98 9.94 0.128 5th E
## 5 Ben Youssef Meïté Ivory Coast 29 10.03 9.97 9.96 0.156 6th F
## 6 Jimmy Vicaut France 24 10.19 9.95 10.04 0.140 7th G
## 7 Trayvon Bromell United States 21 10.13 10.01 10.06 0.135 8th H
The .assign()
method is slower (because it has to create a copy of the data) but it allows you to keep the old, unmodified data frame in addition to having the new, modified one.
Appending a column inside a loop:
The incorrect way:
new_df = df['Athlete'] # This creates a SERIES, not a DATA FRAME
for col in ['Semi-Final', 'Final']:
new_df[col] = df[col] # This does not work as expected
The correct way (although it might generate a SettingWithCopy warning):
new_df = df.loc[:, ['Athlete']] # This creates a DATA FRAME
for col in ['Semi-Final', 'Final']:
new_df[col] = df[col] # This works as expected
print(new_df)
## Athlete Semi-Final Final
## 0 Usain Bolt 9.86 9.81
## 1 Justin Gatlin 9.94 9.89
## 2 Andre De Grasse 9.92 9.91
## 3 Yohan Blake 10.01 9.93
## 4 Akani Simbine 9.98 9.94
## 5 Ben Youssef Meïté 9.97 9.96
## 6 Jimmy Vicaut 9.95 10.04
## 7 Trayvon Bromell 10.01 10.06
The correct way (which will not generate a SettingWithCopy warning):
new_df = df.loc[:, ['Athlete']]
for col in ['Semi-Final', 'Final']:
new_df[col] = df[col].copy()
print(new_df)
## Athlete Semi-Final Final
## 0 Usain Bolt 9.86 9.81
## 1 Justin Gatlin 9.94 9.89
## 2 Andre De Grasse 9.92 9.91
## 3 Yohan Blake 10.01 9.93
## 4 Akani Simbine 9.98 9.94
## 5 Ben Youssef Meïté 9.97 9.96
## 6 Jimmy Vicaut 9.95 10.04
## 7 Trayvon Bromell 10.01 10.06
‘Appending’ a column will place it on the end, whereas ‘inserting’ a column will place it at the location specified (in this case it will be at position 1).
new_col = [
'Jamaica', 'United States', 'Canada', 'Jamaica', 'South Africa', 'Ivory Coast', 'France', 'United States'
]
new_df.insert(1, 'Country', new_col)
print(new_df)
## Athlete Country Semi-Final Final
## 0 Usain Bolt Jamaica 9.86 9.81
## 1 Justin Gatlin United States 9.94 9.89
## 2 Andre De Grasse Canada 9.92 9.91
## 3 Yohan Blake Jamaica 10.01 9.93
## 4 Akani Simbine South Africa 9.98 9.94
## 5 Ben Youssef Meïté Ivory Coast 9.97 9.96
## 6 Jimmy Vicaut France 9.95 10.04
## 7 Trayvon Bromell United States 10.01 10.06
Instead of adding one column to the side of a data frame, now we are adding a whole data frame onto the side:
new_col_df = pd.DataFrame({
'Reaction Time': [0.155, 0.152, 0.141, 0.145, 0.128, 0.156, 0.140, 0.135],
'Finish Position': ['Gold', 'Silver', 'Bronze', '4th', '5th', '6th', '7th', '8th'],
})
new_df = pd.concat([new_df, new_col_df], axis=1)
print(new_df)
## Athlete Country Semi-Final Final Reaction Time Finish Position
## 0 Usain Bolt Jamaica 9.86 9.81 0.155 Gold
## 1 Justin Gatlin United States 9.94 9.89 0.152 Silver
## 2 Andre De Grasse Canada 9.92 9.91 0.141 Bronze
## 3 Yohan Blake Jamaica 10.01 9.93 0.145 4th
## 4 Akani Simbine South Africa 9.98 9.94 0.128 5th
## 5 Ben Youssef Meïté Ivory Coast 9.97 9.96 0.156 6th
## 6 Jimmy Vicaut France 9.95 10.04 0.140 7th
## 7 Trayvon Bromell United States 10.01 10.06 0.135 8th
‘Merging’ is a more complicated form of combining data frames than ‘appending’ or ‘concatenating’. The two objects get combined such that where they have matching values the data gets added together in the same row as opposed to in a new row. It’s complicated enough that I’ve made a whole page on merging.
Use the agg()
method to aggregate two (or more) columns of text into one with a custom substring as a conjunction:
# Join multiple columns of text
ser = df[['Finish Position', 'Athlete']].agg(': '.join, axis=1)
print(ser)
## 0 Gold: Usain Bolt
## 1 Silver: Justin Gatlin
## 2 Bronze: Andre De Grasse
## 3 4th: Yohan Blake
## 4 5th: Akani Simbine
## 5 6th: Ben Youssef Meïté
## 6 7th: Jimmy Vicaut
## 7 8th: Trayvon Bromell
## dtype: object
Rename one column as follows:
# Format is {'old_name': 'new_name'}
df_renamed = df.rename(columns={'Athlete': 'Name'})
print(list(df_renamed))
## ['Name', 'Country', 'Age', 'Heat', 'Semi-Final', 'Final', 'Reaction Time', 'Finish Position']
Rename multiple columns as follows:
# Format is {'old_name': 'new_name'}
df_renamed = df.rename(
columns={
'Heat': 'Race 1',
'Semi-Final': 'Race 2',
'Final': 'Race 3',
}
)
print(list(df_renamed))
## ['Athlete', 'Country', 'Age', 'Race 1', 'Race 2', 'Race 3', 'Reaction Time', 'Finish Position']
Alternatively, you can use a dictionary comprehension to rename all columns at once:
new_names = ['Name', 'Nation', 'Years', 'Race 1', 'Race 2', 'Race 3', 'Position', 'Reaction']
dct = {old: new for old, new in zip(list(df), new_names)}
df_renamed = df.rename(columns=dct)
print(list(df_renamed))
## ['Name', 'Nation', 'Years', 'Race 1', 'Race 2', 'Race 3', 'Position', 'Reaction']
Delete column by column name:
df = df.drop('Reaction Time', axis=1)
print(df)
## Athlete Country Age Heat Semi-Final Final Finish Position
## 0 Usain Bolt Jamaica 29 10.07 9.86 9.81 Gold
## 1 Justin Gatlin United States 34 10.01 9.94 9.89 Silver
## 2 Andre De Grasse Canada 21 10.04 9.92 9.91 Bronze
## 3 Yohan Blake Jamaica 26 10.11 10.01 9.93 4th
## 4 Akani Simbine South Africa 22 10.14 9.98 9.94 5th
## 5 Ben Youssef Meïté Ivory Coast 29 10.03 9.97 9.96 6th
## 6 Jimmy Vicaut France 24 10.19 9.95 10.04 7th
## 7 Trayvon Bromell United States 21 10.13 10.01 10.06 8th
Delete column(s) by index:
df = df.drop(df.columns[[6]], axis=1)
print(df)
## Athlete Country Age Heat Semi-Final Final
## 0 Usain Bolt Jamaica 29 10.07 9.86 9.81
## 1 Justin Gatlin United States 34 10.01 9.94 9.89
## 2 Andre De Grasse Canada 21 10.04 9.92 9.91
## 3 Yohan Blake Jamaica 26 10.11 10.01 9.93
## 4 Akani Simbine South Africa 22 10.14 9.98 9.94
## 5 Ben Youssef Meïté Ivory Coast 29 10.03 9.97 9.96
## 6 Jimmy Vicaut France 24 10.19 9.95 10.04
## 7 Trayvon Bromell United States 21 10.13 10.01 10.06
Iterating over a data frame’s rows using .iterrows()
for i, row in df.iterrows():
row_num = i
person = row['Athlete']
age = row['Age']
print(i, person, age)
## 0 Usain Bolt 29
## 1 Justin Gatlin 34
## 2 Andre De Grasse 21
## 3 Yohan Blake 26
## 4 Akani Simbine 22
## 5 Ben Youssef Meïté 29
## 6 Jimmy Vicaut 24
## 7 Trayvon Bromell 21
Iterating over a data frame’s rows using .itertuples()
for row in df.itertuples():
row_num = row.Index
person = row.Athlete
age = row.Age
print(row_num, person, age)
## 0 Usain Bolt 29
## 1 Justin Gatlin 34
## 2 Andre De Grasse 21
## 3 Yohan Blake 26
## 4 Akani Simbine 22
## 5 Ben Youssef Meïté 29
## 6 Jimmy Vicaut 24
## 7 Trayvon Bromell 21
Iterate over a data frame’s columns using .items()
for colname, col in df.items():
Gatlin = col[1]
Simbine = col[4]
print(colname, Gatlin, Simbine)
## Athlete Justin Gatlin Akani Simbine
## Country United States South Africa
## Age 34 22
## Heat 10.01 10.14
## Semi-Final 9.94 9.98
## Final 9.89 9.94
Iterate over a data frame’s columns using the column names
for colname in list(df):
Bolt = df[colname][1]
print(Bolt)
## Justin Gatlin
## United States
## 34
## 10.01
## 9.94
## 9.89
Iterate over every cell in a data frame by using BOTH .iterrows()
and .items()
# Use a smaller data frame for this example
dct = {
'Athlete': ['Usain Bolt', 'Justin Gatlin', 'Andre De Grasse'],
'Country': ['Jamaica', 'United States', 'Canada'],
'Age': [29, 34, 21],
'Heat': [10.07, 10.01, 10.04],
'Semi-Final': [9.86, 9.94, 9.92],
'Final': [9.81, 9.89, 9.91],
}
df_small = pd.DataFrame(dct)
# Iterate over every cell
for i, row in df_small.iterrows():
for colname, cell in row.items():
print(f'{colname:10} {cell}')
## Athlete Usain Bolt
## Country Jamaica
## Age 29
## Heat 10.07
## Semi-Final 9.86
## Final 9.81
## Athlete Justin Gatlin
## Country United States
## Age 34
## Heat 10.01
## Semi-Final 9.94
## Final 9.89
## Athlete Andre De Grasse
## Country Canada
## Age 21
## Heat 10.04
## Semi-Final 9.92
## Final 9.91
Iterate over every cell in a single row by reducing the data frame to a single row then iterating through the columns:
# Iterate over every cell in row 3 (index number 2)
idx = 2
for colname, col in df[idx:idx + 1].items():
cell = col.values[0]
print(cell)
## Andre De Grasse
## Canada
## 21
## 10.04
## 9.92
## 9.91
‘Indexing’ is the method used to select or extract data in a data frame.
There are two methods that can be used to get certain rows and columns out of a data frame:
.loc[]
uses the names of the columns you want.iloc[]
uses the indices of the columns you wantRemember that row names are the same as row indices, so both methods use these.
Columns names should be provided as lists (see the ‘extract one cell’ section to see what happens if you don’t) while column indices and row indices can be provided as either lists or slices. The advantage of using slices is that you can quickly provide many numbers; the disadvantage is that these numbers have to be consecutive.
Here’s a demonstration of .loc[]
and .iloc[]
at work using lists:
# Format is [[row_idxs], [col_names]]
subset = df.loc[[2, 4], ['Athlete', 'Final']]
print(subset)
## Athlete Final
## 2 Andre De Grasse 9.91
## 4 Akani Simbine 9.94
# Format is [[row_idxs], [col_idxs]]
subset = df.iloc[[2, 4], [0, 5]]
print(subset)
## Athlete Final
## 2 Andre De Grasse 9.91
## 4 Akani Simbine 9.94
Extract multiple consecutive rows and columns:
As mentioned, using slices instead of lists when specifying indices can be more efficient but only works when extracting consecutive rows and columns:
# Format is [slice, [col_names]] to get consecutive rows
subset = df.loc[2:4, ['Athlete', 'Country', 'Age']]
print(subset)
## Athlete Country Age
## 2 Andre De Grasse Canada 21
## 3 Yohan Blake Jamaica 26
## 4 Akani Simbine South Africa 22
# Format is [slice, slice] to get consecutive rows and cols
subset = df.iloc[2:5, :3]
print(subset)
## Athlete Country Age
## 2 Andre De Grasse Canada 21
## 3 Yohan Blake Jamaica 26
## 4 Akani Simbine South Africa 22
Also note that in the above examples .loc[]
took 2:4
as its slice of row indices while .iloc[]
took 2:5
and yet they returned the same rows! The methods handle slices differently!
Similar to the above, .loc[]
and .iloc[]
can be used to get multiple rows (and all columns). This is a rare occasion where both methods are used in the exact same way!
# Format is [[row_idxs], :] to get multiple rows and all columns
rows = df.loc[[1, 3], :]
print(rows)
## Athlete Country Age Heat Semi-Final Final
## 1 Justin Gatlin United States 34 10.01 9.94 9.89
## 3 Yohan Blake Jamaica 26 10.11 10.01 9.93
# Format is [[row_idxs], :] to get multiple rows and all columns
rows = df.iloc[[1, 3], :]
print(rows)
## Athlete Country Age Heat Semi-Final Final
## 1 Justin Gatlin United States 34 10.01 9.94 9.89
## 3 Yohan Blake Jamaica 26 10.11 10.01 9.93
Extract multiple consecutive rows:
Use slices instead of lists for the row indices/names:
# Format is [slice, :] to get consecutive rows and all columns
rows = df.loc[2:3, :]
print(rows)
## Athlete Country Age Heat Semi-Final Final
## 2 Andre De Grasse Canada 21 10.04 9.92 9.91
## 3 Yohan Blake Jamaica 26 10.11 10.01 9.93
# Format is [slice, :] to get consecutive rows and all columns
rows = df.iloc[2:4, :]
print(rows)
## Athlete Country Age Heat Semi-Final Final
## 2 Andre De Grasse Canada 21 10.04 9.92 9.91
## 3 Yohan Blake Jamaica 26 10.11 10.01 9.93
Once again note that .loc[]
and .iloc[]
handle slices differently (2:3
and 2:4
produced the same result!).
Bonus: you can index a data frame directly (ie without using a function or method) with a slice to get consecutive rows:
# Format is [slice] to get consecutive rows
rows = df[slice(2, 4)]
print(rows)
## Athlete Country Age Heat Semi-Final Final
## 2 Andre De Grasse Canada 21 10.04 9.92 9.91
## 3 Yohan Blake Jamaica 26 10.11 10.01 9.93
rows = df[2:4]
print(rows)
## Athlete Country Age Heat Semi-Final Final
## 2 Andre De Grasse Canada 21 10.04 9.92 9.91
## 3 Yohan Blake Jamaica 26 10.11 10.01 9.93
Extracting just one row uses the same process as extracting multiple rows, except this time only one row is specified. Any of the three aforementioned methods can be used (.loc[]
, .iloc[]
and direct indexing).
To extract the row as a data frame (ie as a spreadsheet with only one row), use a list (ie have square brackets around the row index) or slice:
# Format is [[number], :] to get one row and all columns as a data frame
row = df.loc[[5], :]
print(row)
## Athlete Country Age Heat Semi-Final Final
## 5 Ben Youssef Meïté Ivory Coast 29 10.03 9.97 9.96
# Format is [[number], :] to get one row and all columns as a data frame
row = df.iloc[[5], :]
print(row)
## Athlete Country Age Heat Semi-Final Final
## 5 Ben Youssef Meïté Ivory Coast 29 10.03 9.97 9.96
# Format is [slice(x, x + 1)] to get one row and all columns as a data frame
row = df[slice(5, 6)]
print(row)
## Athlete Country Age Heat Semi-Final Final
## 5 Ben Youssef Meïté Ivory Coast 29 10.03 9.97 9.96
# Format is [x:x + 1] to get one row and all columns as a data frame
row = df[5:6]
print(row)
## Athlete Country Age Heat Semi-Final Final
## 5 Ben Youssef Meïté Ivory Coast 29 10.03 9.97 9.96
To extract the row as a series (ie as a column with only one row), use a value (ie do not have square brackets around the row index):
# Format is [number, :] to get one row and all columns as a series
row = df.loc[5, :]
print(row)
## Athlete Ben Youssef Meïté
## Country Ivory Coast
## Age 29
## Heat 10.03
## Semi-Final 9.97
## Final 9.96
## Name: 5, dtype: object
# Format is [number, :] to get one row and all columns as a series
row = df.iloc[5, :]
print(row)
## Athlete Ben Youssef Meïté
## Country Ivory Coast
## Age 29
## Heat 10.03
## Semi-Final 9.97
## Final 9.96
## Name: 5, dtype: object
# Format is [number] to get one row and all columns as a series
row = df.loc[5]
print(row)
## Athlete Ben Youssef Meïté
## Country Ivory Coast
## Age 29
## Heat 10.03
## Semi-Final 9.97
## Final 9.96
## Name: 5, dtype: object
Extract one row and only certain columns:
If you use any of the above three methods to extract a single row as a series you can then use direct indexing to extract only certain columns:
# Format is [number, :][[col_names]] to get one row and certain columns as a series
row = df.loc[5, :][['Athlete', 'Semi-Final', 'Final']]
print(row)
## Athlete Ben Youssef Meïté
## Semi-Final 9.97
## Final 9.96
## Name: 5, dtype: object
# Format is [number, :][[col_names]] to get one row and certain columns as a series
row = df.iloc[5, :][['Athlete', 'Semi-Final', 'Final']]
print(row)
## Athlete Ben Youssef Meïté
## Semi-Final 9.97
## Final 9.96
## Name: 5, dtype: object
# Format is [number][[col_names]] to get one row and certain columns as a series
row = df.loc[5][['Athlete', 'Semi-Final', 'Final']]
print(row)
## Athlete Ben Youssef Meïté
## Semi-Final 9.97
## Final 9.96
## Name: 5, dtype: object
In practice, only the last of these three methods makes sense to use because the first two tell Python to extract all columns and then to only extract certain columns!
.loc[]
and .iloc[]
can also be used to get multiple columns (and all rows):
# Format is [:, [col_names]] to get all rows and certain columns
cols = df.loc[:, ['Athlete', 'Age']]
print(cols)
## Athlete Age
## 0 Usain Bolt 29
## 1 Justin Gatlin 34
## 2 Andre De Grasse 21
## 3 Yohan Blake 26
## 4 Akani Simbine 22
## 5 Ben Youssef Meïté 29
## 6 Jimmy Vicaut 24
## 7 Trayvon Bromell 21
# Format is [:, [col_idxs]] to get all rows and certain columns
cols = df.iloc[:, [0, 2]]
print(cols)
## Athlete Age
## 0 Usain Bolt 29
## 1 Justin Gatlin 34
## 2 Andre De Grasse 21
## 3 Yohan Blake 26
## 4 Akani Simbine 22
## 5 Ben Youssef Meïté 29
## 6 Jimmy Vicaut 24
## 7 Trayvon Bromell 21
Bonus: you can index a data frame directly (ie without using a function or method) with a list to get columns:
# Format is [[col_names]] to get all rows and certain columns
cols = df[['Athlete', 'Age']]
print(cols)
## Athlete Age
## 0 Usain Bolt 29
## 1 Justin Gatlin 34
## 2 Andre De Grasse 21
## 3 Yohan Blake 26
## 4 Akani Simbine 22
## 5 Ben Youssef Meïté 29
## 6 Jimmy Vicaut 24
## 7 Trayvon Bromell 21
# Format is [list(col_names)] to get all rows and certain columns
cols = df[list(('Athlete', 'Age'))]
print(cols)
## Athlete Age
## 0 Usain Bolt 29
## 1 Justin Gatlin 34
## 2 Andre De Grasse 21
## 3 Yohan Blake 26
## 4 Akani Simbine 22
## 5 Ben Youssef Meïté 29
## 6 Jimmy Vicaut 24
## 7 Trayvon Bromell 21
Extract multiple consecutive columns:
Use slices instead of lists for the column indices. It should be clear that this can only be used together with .iloc[]
.
# Format is [:, slice] to get all rows and consecutive columns
columns = df.iloc[:, :3]
print(columns)
## Athlete Country Age
## 0 Usain Bolt Jamaica 29
## 1 Justin Gatlin United States 34
## 2 Andre De Grasse Canada 21
## 3 Yohan Blake Jamaica 26
## 4 Akani Simbine South Africa 22
## 5 Ben Youssef Meïté Ivory Coast 29
## 6 Jimmy Vicaut France 24
## 7 Trayvon Bromell United States 21
Extracting just one column uses the same process as extracting multiple columns, except this time only one column is specified. Any of the three aforementioned methods can be used: .loc[]
, .iloc[]
and direct indexing.
To extract the column as a data frame (ie as a spreadsheet with only one column), use a list (ie have square brackets around the column name/index):
# Format is [:, ['col_name']] to get all rows and one column as a data frame
age = df.loc[:, ['Age']]
# Format is [:, [col_idx]] to get all rows and one column as a data frame
age = df.iloc[:, [2]]
# Format is [[col_idx]] to get all rows and one column as a data frame
age = df[['Age']]
print(age)
## Age
## 0 29
## 1 34
## 2 21
## 3 26
## 4 22
## 5 29
## 6 24
## 7 21
To extract the column as a series (ie as a ‘column object’), use a value (ie do not have square brackets around the column name/index):
# Format is [:, 'col_name'] to get all rows and one column as a series
age = df.loc[:, 'Age']
# Format is [:, col_idx] to get all rows and one column as a series
age = df.iloc[:, 2]
# Format is [col_idx] to get all rows and one column as a series
age = df['Age']
print(age)
## 0 29
## 1 34
## 2 21
## 3 26
## 4 22
## 5 29
## 6 24
## 7 21
## Name: Age, dtype: int64
Bonus: you can use the direct indexing method on any data frame, which means that the following will work:
# Format is [:][col_name] to get all rows and one column as a series
age = df.loc[:]['Age']
print(age)
## 0 29
## 1 34
## 2 21
## 3 26
## 4 22
## 5 29
## 6 24
## 7 21
## Name: Age, dtype: int64
To extract the column as a list (ie get just the values in the column) use the .tolist()
method:
age = df['Age'].tolist()
print(age)
## [29, 34, 21, 26, 22, 29, 24, 21]
Extract one column and only certain rows:
Any of the above methods that use .loc[]
or .iloc[]
can be simply adapted to extract only certain rows by replacing the colon with either a list or a slice:
# Format is [[row_idx], 'col_name'] to get certain rows and one column as a series
age = df.loc[[2, 3], 'Athlete']
print(age)
## 2 Andre De Grasse
## 3 Yohan Blake
## Name: Athlete, dtype: object
# Format is [slice, col_idx] to get certain, consecutive rows and one column as a series
age = df.iloc[2:4, 0]
print(age)
## 2 Andre De Grasse
## 3 Yohan Blake
## Name: Athlete, dtype: object
Bonus: again, because you can use the direct indexing method on any data frame, the following is also an option:
# Format is [slice][col_name] to get certain, consecutive rows and one column as a series
age = df.loc[2:3]['Athlete']
print(age)
## 2 Andre De Grasse
## 3 Yohan Blake
## Name: Athlete, dtype: object
The most correct way to extract a single value from a data frame or series is to use .at[]
or .iat[]
depending on whether you want to use the name or the index of the column, respectively.
.at[]
uses the name of the column you are interested in:
# Format is [row_idx, 'col_name']
cell = df.at[3, 'Athlete']
print(cell)
## Yohan Blake
.iat[]
uses the index of the column you are interested in:
# Format is [row_idx, col_idx]
cell = df.iat[3, 0]
print(cell)
## Yohan Blake
Note that it is also possible (but less ‘correct’) to use .loc[]
and .iloc[]
to extract a single cell. These methods are actually meant to be used for extracting groups of rows and columns, but if you use them to extract a group of one row and one column it’s exactly the same as extracting a single cell!
First, this is how you use .loc[]
and .iloc[]
to extract single cells as data frames (ie spreadsheets, which in this case only have one cell):
# Format is [[row_idxs], [col_names]] to get a data frame
cell = df.loc[[3], ['Athlete']]
print(cell)
## Athlete
## 3 Yohan Blake
# Format is [[row_idxs], [col_idxs]] to get a data frame
cell = df.iloc[[3], [0]]
print(cell)
## Athlete
## 3 Yohan Blake
Extract single cells as series (ie columns, which in this case only have one cell) by providing either the row or the column as a value, not as a list (ie don’t use square brackets). Whether you provide the row or the column as a value will determine whether your result is named after it’s row or it’s column:
# Format is [[row_idx], col_name] to get a cell named after it's column
cell = df.loc[[3], 'Athlete']
print(cell)
## 3 Yohan Blake
## Name: Athlete, dtype: object
# Format is [row_idx, [col_idxs]] to get a cell named after it's row
cell = df.iloc[3, [0]]
print(cell)
## Athlete Yohan Blake
## Name: 3, dtype: object
To use .loc[]
and .iloc[]
to extract single values as strings or numbers, provide both the row and the column as values as opposed to as lists:
# Format is [row_idx, 'col_name'] if you want a single cell
cell = df.loc[3, 'Athlete']
print(cell)
## Yohan Blake
# Format is [row_idx, col_idx] if you want a single cell
cell = df.iloc[3, 0]
print(cell)
## Yohan Blake
Finally, you can also use a mixture of these methods and direct indexing to extract single values:
# Format is [row_idx]['col_name'] if you want a single cell
cell = df.loc[3]['Athlete']
print(cell)
## Yohan Blake
# Format is [row_idx][col_idx] if you want a single cell
cell = df.iloc[3][0]
print(cell)
## Yohan Blake
Ascending order:
Sort on one column:
df_sorted = df.sort_values('Athlete')
print(df_sorted)
## Athlete Country Age Heat Semi-Final Final
## 4 Akani Simbine South Africa 22 10.14 9.98 9.94
## 2 Andre De Grasse Canada 21 10.04 9.92 9.91
## 5 Ben Youssef Meïté Ivory Coast 29 10.03 9.97 9.96
## 6 Jimmy Vicaut France 24 10.19 9.95 10.04
## 1 Justin Gatlin United States 34 10.01 9.94 9.89
## 7 Trayvon Bromell United States 21 10.13 10.01 10.06
## 0 Usain Bolt Jamaica 29 10.07 9.86 9.81
## 3 Yohan Blake Jamaica 26 10.11 10.01 9.93
Sort first by one column, then by another:
df_sorted = df.sort_values(['Country', 'Age'])
print(df_sorted)
## Athlete Country Age Heat Semi-Final Final
## 2 Andre De Grasse Canada 21 10.04 9.92 9.91
## 6 Jimmy Vicaut France 24 10.19 9.95 10.04
## 5 Ben Youssef Meïté Ivory Coast 29 10.03 9.97 9.96
## 3 Yohan Blake Jamaica 26 10.11 10.01 9.93
## 0 Usain Bolt Jamaica 29 10.07 9.86 9.81
## 4 Akani Simbine South Africa 22 10.14 9.98 9.94
## 7 Trayvon Bromell United States 21 10.13 10.01 10.06
## 1 Justin Gatlin United States 34 10.01 9.94 9.89
Descending order:
df_sorted = df.sort_values('Athlete', ascending=False, na_position='first')
print(df_sorted)
## Athlete Country Age Heat Semi-Final Final
## 3 Yohan Blake Jamaica 26 10.11 10.01 9.93
## 0 Usain Bolt Jamaica 29 10.07 9.86 9.81
## 7 Trayvon Bromell United States 21 10.13 10.01 10.06
## 1 Justin Gatlin United States 34 10.01 9.94 9.89
## 6 Jimmy Vicaut France 24 10.19 9.95 10.04
## 5 Ben Youssef Meïté Ivory Coast 29 10.03 9.97 9.96
## 2 Andre De Grasse Canada 21 10.04 9.92 9.91
## 4 Akani Simbine South Africa 22 10.14 9.98 9.94
Using the columns’ names:
The method below can also be used to remove a column: just leave it out of the indexing.
df_reordered = df[['Heat', 'Semi-Final', 'Final', 'Athlete', 'Country', 'Age']]
print(list(df_reordered))
## ['Heat', 'Semi-Final', 'Final', 'Athlete', 'Country', 'Age']
Using the columns’ indices:
cols = list(df)
cols = cols[:2] + [cols[-1]] + cols[2:-1]
df_reordered = df[cols]
print(list(df_reordered))
## ['Athlete', 'Country', 'Final', 'Age', 'Heat', 'Semi-Final']
For more ways to export data frames, see the page on file IO.
# Export data frame to comma-separated values (.csv)
df.to_csv('data frame.csv', index=False)