⇦ Back

This page is the first part of a series on data frames:

  1. Introduction to Data Frames
  2. Searching, Finding & Filtering Data Frames
  3. Calculations with Data Frames
  4. Multi-Level Data Frames

1 What’s a Data Frame?

A data frame is a table. It is the programming equivalent of an Excel spreadsheet. It has rows and columns with row headings (known as indexes/indices) and column headings.

Data frames are not one of the object types that get included in Python by default. In order to unlock them, you need to install the ‘Pandas’ package and then import it into each script where you want to use them:

  • In the terminal run pip3 install pandas
  • At the top of each script in which you want to use data frames import Pandas with import pandas as pd
  • The objects and functions included in Pandas can then be used by prepending the shorthand pd.

“Pandas” is short for “panel data” which is a term for the type of information you get when you take multiple measurements over a number of time points.

2 Creating a Data Frame

Create from dictionary:
A data frame can be ‘manually’ created by using the DataFrame() function from Pandas. This takes as its input a dictionary object. It turns this into a data frame using the dictionary’s keys as the column headings and the dictionary’s values as the rows. Note that all the columns need to be the same length. In other words, each of the dictionary’s values (which will usually be lists) need to have the same number of elements.

For this page we will use the results of the men’s 100m finals from the Rio 2016 Olympics as our example data:

import pandas as pd

rio_100m_men = {
    'Athlete': [
        'Usain Bolt', 'Justin Gatlin', 'Andre De Grasse', 'Yohan Blake', 'Akani Simbine', 'Ben Youssef Meïté',
        'Jimmy Vicaut', 'Trayvon Bromell'
    ],
    'Country': [
        'Jamaica', 'United States', 'Canada', 'Jamaica', 'South Africa', 'Ivory Coast', 'France', 'United States'
    ],
    'Age': [29, 34, 21, 26, 22, 29, 24, 21],
    'Heat': [10.07, 10.01, 10.04, 10.11, 10.14, 10.03, 10.19, 10.13],
    'Semi-Final': [9.86, 9.94, 9.92, 10.01, 9.98, 9.97, 9.95, 10.01],
    'Final': [9.81, 9.89, 9.91, 9.93, 9.94, 9.96, 10.04, 10.06],
}
df = pd.DataFrame(rio_100m_men)
print(df)
##              Athlete        Country  Age   Heat  Semi-Final  Final
## 0         Usain Bolt        Jamaica   29  10.07        9.86   9.81
## 1      Justin Gatlin  United States   34  10.01        9.94   9.89
## 2    Andre De Grasse         Canada   21  10.04        9.92   9.91
## 3        Yohan Blake        Jamaica   26  10.11       10.01   9.93
## 4      Akani Simbine   South Africa   22  10.14        9.98   9.94
## 5  Ben Youssef Meïté    Ivory Coast   29  10.03        9.97   9.96
## 6       Jimmy Vicaut         France   24  10.19        9.95  10.04
## 7    Trayvon Bromell  United States   21  10.13       10.01  10.06

As you can see, we started with a dictionary rio_100m_men which had the same number of elements (8) in each value. This was then converted into a Pandas data frame and each key became a column heading and each value became a column corresponding to its key. Of course, the fact that Python is a ‘zero-indexed’ language means that Usain Bolt is shown in the above table as having come ‘0th’! That doesn’t make much sense so let’s re-create the data frame using a custom index (custom row headings):

df = pd.DataFrame({
        'Athlete': [
            'Usain Bolt', 'Justin Gatlin', 'Andre De Grasse', 'Yohan Blake', 'Akani Simbine',
            'Ben Youssef Meïté', 'Jimmy Vicaut', 'Trayvon Bromell'
        ],
        'Country': [
            'Jamaica', 'United States', 'Canada', 'Jamaica', 'South Africa', 'Ivory Coast', 'France', 'United States'
        ],
        'Age': [29, 34, 21, 26, 22, 29, 24, 21],
        'Heat': [10.07, 10.01, 10.04, 10.11, 10.14, 10.03, 10.19, 10.13],
        'Semi-Final': [9.86, 9.94, 9.92, 10.01, 9.98, 9.97, 9.95, 10.01],
        'Final': [9.81, 9.89, 9.91, 9.93, 9.94, 9.96, 10.04, 10.06],
    }, index=[1, 2, 3, 4, 5, 6, 7, 8]
)
print(df)
##              Athlete        Country  Age   Heat  Semi-Final  Final
## 1         Usain Bolt        Jamaica   29  10.07        9.86   9.81
## 2      Justin Gatlin  United States   34  10.01        9.94   9.89
## 3    Andre De Grasse         Canada   21  10.04        9.92   9.91
## 4        Yohan Blake        Jamaica   26  10.11       10.01   9.93
## 5      Akani Simbine   South Africa   22  10.14        9.98   9.94
## 6  Ben Youssef Meïté    Ivory Coast   29  10.03        9.97   9.96
## 7       Jimmy Vicaut         France   24  10.19        9.95  10.04
## 8    Trayvon Bromell  United States   21  10.13       10.01  10.06

Create from array or lists:
Alternatively, a data frame can be created from a Numpy array and/or a list-of-lists instead of from a dictionary:

import numpy as np

# Create an array from a list-of-lists
rio_100m_men = np.array([
    ['Athlete', 'Country', 'Age', 'Heat', 'Semi-Final', 'Final'],
    ['Usain Bolt', 'Jamaica', 29, 10.07, 9.86, 9.81],
    ['Justin Gatlin', 'United States', 34, 10.01, 9.94, 9.89],
    ['Andre De Grasse', 'Canada', 21, 10.04, 9.92, 9.91],
    ['Yohan Blake', 'Jamaica', 26, 10.11, 10.01, 9.93],
    ['Akani Simbine', 'South Africa', 22, 10.14, 9.98, 9.94],
    ['Ben Youssef Meïté', 'Ivory Coast', 29, 10.03, 9.97, 9.96],
    ['Jimmy Vicaut', 'France', 24, 10.19, 9.95, 10.04],
    ['Trayvon Bromell', 'United States', 21, 10.13, 10.01, 10.06],
])
# Convert the array to a data frame
df = pd.DataFrame(
    data=rio_100m_men[1:, :],
    columns=rio_100m_men[0, :]
)
print(df)
##              Athlete        Country Age   Heat Semi-Final  Final
## 0         Usain Bolt        Jamaica  29  10.07       9.86   9.81
## 1      Justin Gatlin  United States  34  10.01       9.94   9.89
## 2    Andre De Grasse         Canada  21  10.04       9.92   9.91
## 3        Yohan Blake        Jamaica  26  10.11      10.01   9.93
## 4      Akani Simbine   South Africa  22  10.14       9.98   9.94
## 5  Ben Youssef Meïté    Ivory Coast  29  10.03       9.97   9.96
## 6       Jimmy Vicaut         France  24  10.19       9.95  10.04
## 7    Trayvon Bromell  United States  21  10.13      10.01  10.06

Import from file:
See the page on file IO for more info on importing data frames.

# Import comma-separated values (.csv file) as a data frame
df = pd.read_csv('data frame.csv')

Import from URL:
Any data object on the internet that could be interpreted as a data frame can be imported:

# Import comma-separated values (.csv file) as a data frame
data_url = 'https://raw.githubusercontent.com/rowannicholls/rowannicholls.github.io/master/python/advanced/data%20frame.csv'
df = pd.read_csv(data_url)

3 Displaying a Data Frame

Use the .head() method to see only the first 5 rows of a data frame (or .head(n) to see the first n rows). This is useful if you have a lot of data!

print(df.head())
##            Athlete        Country  Age   Heat  Semi-Final  Final
## 0       Usain Bolt        Jamaica   29  10.07        9.86   9.81
## 1    Justin Gatlin  United States   34  10.01        9.94   9.89
## 2  Andre De Grasse         Canada   21  10.04        9.92   9.91
## 3      Yohan Blake        Jamaica   26  10.11       10.01   9.93
## 4    Akani Simbine   South Africa   22  10.14        9.98   9.94

Similarly, .tail() will show the last 5 rows (and .tail(n) will show the last n rows):

print(df.tail())
##              Athlete        Country  Age   Heat  Semi-Final  Final
## 3        Yohan Blake        Jamaica   26  10.11       10.01   9.93
## 4      Akani Simbine   South Africa   22  10.14        9.98   9.94
## 5  Ben Youssef Meïté    Ivory Coast   29  10.03        9.97   9.96
## 6       Jimmy Vicaut         France   24  10.19        9.95  10.04
## 7    Trayvon Bromell  United States   21  10.13       10.01  10.06

3.1 Display options

Change the amount of information that gets shown in your console when you print a data frame by tweaking the below options:

pd.set_option('display.max_rows', 3)
pd.set_option('display.max_columns', 20)
pd.set_option('display.max_colwidth', 40)
pd.set_option('display.width', 200)
pd.set_option('display.precision', 3)
print(df)
##             Athlete        Country  Age   Heat  Semi-Final  Final
## 0        Usain Bolt        Jamaica   29  10.07        9.86   9.81
## ..              ...            ...  ...    ...         ...    ...
## 7   Trayvon Bromell  United States   21  10.13       10.01  10.06
## 
## [8 rows x 6 columns]

Reset these options with the following

pd.reset_option('display.max_rows')
pd.reset_option('display.max_columns')
pd.reset_option('display.max_colwidth')
pd.reset_option('display.precision')
pd.reset_option('display.width')
print(df)
##              Athlete        Country  Age   Heat  Semi-Final  Final
## 0         Usain Bolt        Jamaica   29  10.07        9.86   9.81
## 1      Justin Gatlin  United States   34  10.01        9.94   9.89
## 2    Andre De Grasse         Canada   21  10.04        9.92   9.91
## 3        Yohan Blake        Jamaica   26  10.11       10.01   9.93
## 4      Akani Simbine   South Africa   22  10.14        9.98   9.94
## 5  Ben Youssef Meïté    Ivory Coast   29  10.03        9.97   9.96
## 6       Jimmy Vicaut         France   24  10.19        9.95  10.04
## 7    Trayvon Bromell  United States   21  10.13       10.01  10.06

For more info about the display options, see here: https://pandas.pydata.org/pandas-docs/stable/user_guide/options.html

4 Getting Information about a Data Frame

Get the names of the columns:

# Get the column names as an 'Index object'
colnames = df.columns

# Get the column names as a list
colnames = df.columns.to_list()
# or
colnames = list(df.columns)
# or
colnames = list(df)

print(colnames)
## ['Athlete', 'Country', 'Age', 'Heat', 'Semi-Final', 'Final']

Notice from the above that when you convert a data frame into a list you are only left with the column names.

The ‘row names’ are more properly called the ‘indices’ (or ‘indexes’ if you’re American):

# Get the row names as an 'Index object'
rownames = df.index

# Get the row names as a list
rownames = df.index.to_list()
# or
rownames = list(df.index)

print(rownames)
## [0, 1, 2, 3, 4, 5, 6, 7]

Get the number of columns:

ncols = df.shape[1]
# or
ncols = len(list(df))

print(ncols)
## 6

Get the number of rows:

nrows = df.shape[0]
# or
nrows = len(df)

print(nrows)
## 8

Display the data types of the values in each column:

print(df.dtypes)
## Athlete        object
## Country        object
## Age             int64
## Heat          float64
## Semi-Final    float64
## Final         float64
## dtype: object

Display a summary of the information in a data frame:

print(df.info())
## <class 'pandas.core.frame.DataFrame'>
## RangeIndex: 8 entries, 0 to 7
## Data columns (total 6 columns):
##  #   Column      Non-Null Count  Dtype  
## ---  ------      --------------  -----  
##  0   Athlete     8 non-null      object 
##  1   Country     8 non-null      object 
##  2   Age         8 non-null      int64  
##  3   Heat        8 non-null      float64
##  4   Semi-Final  8 non-null      float64
##  5   Final       8 non-null      float64
## dtypes: float64(3), int64(1), object(2)
## memory usage: 512.0+ bytes
## None

Display summary statistics for each column (use include='all' to see stats for categorical columns in addition to numeric ones):

print(df.describe(include='all'))
##            Athlete  Country        Age       Heat  Semi-Final      Final
## count            8        8   8.000000   8.000000    8.000000   8.000000
## unique           8        6        NaN        NaN         NaN        NaN
## top     Usain Bolt  Jamaica        NaN        NaN         NaN        NaN
## freq             1        2        NaN        NaN         NaN        NaN
## mean           NaN      NaN  25.750000  10.090000    9.955000   9.942500
## std            NaN      NaN   4.652188   0.062564    0.049857   0.080312
## min            NaN      NaN  21.000000  10.010000    9.860000   9.810000
## 25%            NaN      NaN  21.750000  10.037500    9.935000   9.905000
## 50%            NaN      NaN  25.000000  10.090000    9.960000   9.935000
## 75%            NaN      NaN  29.000000  10.132500    9.987500   9.980000
## max            NaN      NaN  34.000000  10.190000   10.010000  10.060000

Get the count of values in a column:

print(df['Country'].value_counts())
## Country
## Jamaica          2
## United States    2
## Canada           1
## South Africa     1
## Ivory Coast      1
## France           1
## Name: count, dtype: int64

Get a list of the unique values in a column (ie show the countries that were represented in the 100m final, without repeats):

unique = pd.unique(df['Country'])

print(unique)
## ['Jamaica' 'United States' 'Canada' 'South Africa' 'Ivory Coast' 'France']

Check if a data frame is empty or not:

# Create an empty data frame
df_is_empty = pd.DataFrame({'A': []}).empty

print(df_is_empty)
## True

5 Rows in a Data Frame

5.1 Append a row

Append a data frame as a new row by concatenating it to the bottom of the existing data frame via the concat() function. This is the recommended way of doing it as this method preserves the order of the columns:

new_row_dct = {
    'Athlete': 'Chijindu Ujah',
    'Country': 'Great Britain',
    'Age': 22,
    'Heat': 10.13,
    'Semi-Final': 10.01,
    'Final': None,
}
new_row_df = pd.DataFrame(new_row_dct, index=[1])
df_extended = pd.concat([df, new_row_df], ignore_index=True)
print(df_extended)
##              Athlete        Country  Age   Heat  Semi-Final  Final
## 0         Usain Bolt        Jamaica   29  10.07        9.86   9.81
## 1      Justin Gatlin  United States   34  10.01        9.94   9.89
## 2    Andre De Grasse         Canada   21  10.04        9.92   9.91
## 3        Yohan Blake        Jamaica   26  10.11       10.01   9.93
## 4      Akani Simbine   South Africa   22  10.14        9.98   9.94
## 5  Ben Youssef Meïté    Ivory Coast   29  10.03        9.97   9.96
## 6       Jimmy Vicaut         France   24  10.19        9.95  10.04
## 7    Trayvon Bromell  United States   21  10.13       10.01  10.06
## 8      Chijindu Ujah  Great Britain   22  10.13       10.01   None

To append a series as a new row you need to convert it into a data frame with .to_frame(), then transpose it with .T and also ignore the index:

new_row_dct = {
    'Athlete': 'Chijindu Ujah',
    'Country': 'Great Britain',
    'Age': 22,
    'Heat': 10.13,
    'Semi-Final': 10.01,
    'Final': None,
}
new_row_ser = pd.Series(new_row_dct)
df_extended = pd.concat([df, new_row_ser.to_frame().T], ignore_index=True)
print(df_extended)
##              Athlete        Country Age   Heat Semi-Final  Final
## 0         Usain Bolt        Jamaica  29  10.07       9.86   9.81
## 1      Justin Gatlin  United States  34  10.01       9.94   9.89
## 2    Andre De Grasse         Canada  21  10.04       9.92   9.91
## 3        Yohan Blake        Jamaica  26  10.11      10.01   9.93
## 4      Akani Simbine   South Africa  22  10.14       9.98   9.94
## 5  Ben Youssef Meïté    Ivory Coast  29  10.03       9.97   9.96
## 6       Jimmy Vicaut         France  24  10.19       9.95  10.04
## 7    Trayvon Bromell  United States  21  10.13      10.01  10.06
## 8      Chijindu Ujah  Great Britain  22  10.13      10.01   None

If a dictionary’s values are lists like in the example below, you can still convert it into a data frame and append it as normal:

new_row_dct = {
    'Athlete': ['Chijindu Ujah'],
    'Country': ['Great Britain'],
    'Age': [22],
    'Heat': [10.13],
    'Semi-Final': [10.01],
    'Final': [None],
}
new_row_df = pd.DataFrame(new_row_dct)
df_extended = pd.concat([df, new_row_df], ignore_index=True)
print(df_extended)
##              Athlete        Country  Age   Heat  Semi-Final  Final
## 0         Usain Bolt        Jamaica   29  10.07        9.86   9.81
## 1      Justin Gatlin  United States   34  10.01        9.94   9.89
## 2    Andre De Grasse         Canada   21  10.04        9.92   9.91
## 3        Yohan Blake        Jamaica   26  10.11       10.01   9.93
## 4      Akani Simbine   South Africa   22  10.14        9.98   9.94
## 5  Ben Youssef Meïté    Ivory Coast   29  10.03        9.97   9.96
## 6       Jimmy Vicaut         France   24  10.19        9.95  10.04
## 7    Trayvon Bromell  United States   21  10.13       10.01  10.06
## 8      Chijindu Ujah  Great Britain   22  10.13       10.01   None

Append a list as a new row:

df.loc[df.shape[0]] = ['Chijindu Ujah', 'Great Britain', 22, 10.13, 10.01, None]
print(df)
##              Athlete        Country  Age   Heat  Semi-Final  Final
## 0         Usain Bolt        Jamaica   29  10.07        9.86   9.81
## 1      Justin Gatlin  United States   34  10.01        9.94   9.89
## 2    Andre De Grasse         Canada   21  10.04        9.92   9.91
## 3        Yohan Blake        Jamaica   26  10.11       10.01   9.93
## 4      Akani Simbine   South Africa   22  10.14        9.98   9.94
## 5  Ben Youssef Meïté    Ivory Coast   29  10.03        9.97   9.96
## 6       Jimmy Vicaut         France   24  10.19        9.95  10.04
## 7    Trayvon Bromell  United States   21  10.13       10.01  10.06
## 8      Chijindu Ujah  Great Britain   22  10.13       10.01    NaN

The code above works best if the data frame has the ‘default’ row indices, ie if the rows are numbered sequentially from 0. If the data frame is instead a subset of a larger data frame (and thus doesn’t have sequential row indices) or if the indices are strings (see the “rename the rows” section), this code might still work but not produce the expected result:

df_indexed = df.set_index('Athlete')
df_indexed.loc[df_indexed.shape[0]] = [
    'Great Britain', 22, 10.13, 10.01, None
]
# This probably isn't what you want
print(df_indexed)
##                          Country  Age   Heat  Semi-Final  Final
## Athlete                                                        
## Usain Bolt               Jamaica   29  10.07        9.86   9.81
## Justin Gatlin      United States   34  10.01        9.94   9.89
## Andre De Grasse           Canada   21  10.04        9.92   9.91
## Yohan Blake              Jamaica   26  10.11       10.01   9.93
## Akani Simbine       South Africa   22  10.14        9.98   9.94
## Ben Youssef Meïté    Ivory Coast   29  10.03        9.97   9.96
## Jimmy Vicaut              France   24  10.19        9.95  10.04
## Trayvon Bromell    United States   21  10.13       10.01  10.06
## 8                  Great Britain   22  10.13       10.01    NaN

5.2 Insert a row

Whereas ‘appending’ a row will add it onto the bottom of a data frame, the code below ‘inserts’ it into whatever position you specify (in this example it is index 5):

new_row = ['Chijindu Ujah', 'Great Britain', 22, 10.13, 10.01, None]
new_row = np.array([new_row])
new_row = pd.DataFrame(new_row, columns=list(df))
insert_at = 5
objs = [df.iloc[:insert_at, :], new_row, df.iloc[insert_at:, :]]
df_new = pd.concat(objs, ignore_index=True)
print(df_new)
##              Athlete        Country Age   Heat Semi-Final  Final
## 0         Usain Bolt        Jamaica  29  10.07       9.86   9.81
## 1      Justin Gatlin  United States  34  10.01       9.94   9.89
## 2    Andre De Grasse         Canada  21  10.04       9.92   9.91
## 3        Yohan Blake        Jamaica  26  10.11      10.01   9.93
## 4      Akani Simbine   South Africa  22  10.14       9.98   9.94
## 5      Chijindu Ujah  Great Britain  22  10.13      10.01    NaN
## 6  Ben Youssef Meïté    Ivory Coast  29  10.03       9.97   9.96
## 7       Jimmy Vicaut         France  24  10.19       9.95  10.04
## 8    Trayvon Bromell  United States  21  10.13      10.01  10.06

Similarly, the above code works best with the default indices (0, 1, 2, etc).

5.3 Concatenate a data frame as new rows

Instead of adding one row to the bottom of a data frame, now we are adding a whole data frame onto the bottom:

new_row_df = pd.DataFrame({
    'Athlete': ['Jak Ali Harvey', 'Nickel Ashmeade', 'Christophe Lemaitre'],
    'Country': ['Turkey', 'Jamaica', 'France'],
    'Age': [27, 26, 26],
    'Heat': [10.14, 10.13, 10.16],
    'Semi-Final': [10.03, 10.05, 10.07],
    'Final': [None, None, None],
})
new_df = pd.concat([df, new_row_df], ignore_index=True)
print(new_df)
##                 Athlete        Country  Age   Heat  Semi-Final  Final
## 0            Usain Bolt        Jamaica   29  10.07        9.86   9.81
## 1         Justin Gatlin  United States   34  10.01        9.94   9.89
## 2       Andre De Grasse         Canada   21  10.04        9.92   9.91
## 3           Yohan Blake        Jamaica   26  10.11       10.01   9.93
## 4         Akani Simbine   South Africa   22  10.14        9.98   9.94
## 5     Ben Youssef Meïté    Ivory Coast   29  10.03        9.97   9.96
## 6          Jimmy Vicaut         France   24  10.19        9.95  10.04
## 7       Trayvon Bromell  United States   21  10.13       10.01  10.06
## 8        Jak Ali Harvey         Turkey   27  10.14       10.03   None
## 9       Nickel Ashmeade        Jamaica   26  10.13       10.05   None
## 10  Christophe Lemaitre         France   26  10.16       10.07   None

5.4 Merge two data frames

‘Merging’ is a more complicated form of combining data frames than ‘appending’ or ‘concatenating’. The two objects get combined such that where they have matching values the data gets added together in the same row as opposed to in a new row. It’s complicated enough that I’ve made a whole separate page on merging.

5.5 Rename the rows (ie change the row indices)

You can rename rows (as many as you want at one time) using a dictionary together with the rename() function:

# Format is {old_name: new_name}
df = df.rename(index={1: 'Row 1'})
print(df)
##                  Athlete        Country  Age   Heat  Semi-Final  Final
## 0             Usain Bolt        Jamaica   29  10.07        9.86   9.81
## Row 1      Justin Gatlin  United States   34  10.01        9.94   9.89
## 2        Andre De Grasse         Canada   21  10.04        9.92   9.91
## 3            Yohan Blake        Jamaica   26  10.11       10.01   9.93
## 4          Akani Simbine   South Africa   22  10.14        9.98   9.94
## 5      Ben Youssef Meïté    Ivory Coast   29  10.03        9.97   9.96
## 6           Jimmy Vicaut         France   24  10.19        9.95  10.04
## 7        Trayvon Bromell  United States   21  10.13       10.01  10.06
# Format is {old_name: new_name}
df = df.rename(
    index={
        2: 'Row 2',
        3: 'Row 3',
        4: 'Row 4',
    }
)
print(df)
##                  Athlete        Country  Age   Heat  Semi-Final  Final
## 0             Usain Bolt        Jamaica   29  10.07        9.86   9.81
## Row 1      Justin Gatlin  United States   34  10.01        9.94   9.89
## Row 2    Andre De Grasse         Canada   21  10.04        9.92   9.91
## Row 3        Yohan Blake        Jamaica   26  10.11       10.01   9.93
## Row 4      Akani Simbine   South Africa   22  10.14        9.98   9.94
## 5      Ben Youssef Meïté    Ivory Coast   29  10.03        9.97   9.96
## 6           Jimmy Vicaut         France   24  10.19        9.95  10.04
## 7        Trayvon Bromell  United States   21  10.13       10.01  10.06

Reset the row names (row indices) but keep the old ones in a column called “index”:

df_reset = df.reset_index()
print(df_reset)
##    index            Athlete        Country  Age   Heat  Semi-Final  Final
## 0      0         Usain Bolt        Jamaica   29  10.07        9.86   9.81
## 1  Row 1      Justin Gatlin  United States   34  10.01        9.94   9.89
## 2  Row 2    Andre De Grasse         Canada   21  10.04        9.92   9.91
## 3  Row 3        Yohan Blake        Jamaica   26  10.11       10.01   9.93
## 4  Row 4      Akani Simbine   South Africa   22  10.14        9.98   9.94
## 5      5  Ben Youssef Meïté    Ivory Coast   29  10.03        9.97   9.96
## 6      6       Jimmy Vicaut         France   24  10.19        9.95  10.04
## 7      7    Trayvon Bromell  United States   21  10.13       10.01  10.06

Reset the row names completely:

df = df.reset_index(drop = True)
print(df)
##              Athlete        Country  Age   Heat  Semi-Final  Final
## 0         Usain Bolt        Jamaica   29  10.07        9.86   9.81
## 1      Justin Gatlin  United States   34  10.01        9.94   9.89
## 2    Andre De Grasse         Canada   21  10.04        9.92   9.91
## 3        Yohan Blake        Jamaica   26  10.11       10.01   9.93
## 4      Akani Simbine   South Africa   22  10.14        9.98   9.94
## 5  Ben Youssef Meïté    Ivory Coast   29  10.03        9.97   9.96
## 6       Jimmy Vicaut         France   24  10.19        9.95  10.04
## 7    Trayvon Bromell  United States   21  10.13       10.01  10.06

Use a column as the row names (row indices):
Up until now we have been using the row indices that were auto-generated when the data frame was created: the rows were numbered from 0 at the top to 7 at the bottom. However, we don’t have to use those numbers and, in fact, we can use words at the row indices if we want. This is how we can use the athletes’ names as the row indices (the .set_index() function is needed):

df_indexed = df.set_index('Athlete')
print(df_indexed)
##                          Country  Age   Heat  Semi-Final  Final
## Athlete                                                        
## Usain Bolt               Jamaica   29  10.07        9.86   9.81
## Justin Gatlin      United States   34  10.01        9.94   9.89
## Andre De Grasse           Canada   21  10.04        9.92   9.91
## Yohan Blake              Jamaica   26  10.11       10.01   9.93
## Akani Simbine       South Africa   22  10.14        9.98   9.94
## Ben Youssef Meïté    Ivory Coast   29  10.03        9.97   9.96
## Jimmy Vicaut              France   24  10.19        9.95  10.04
## Trayvon Bromell    United States   21  10.13       10.01  10.06

This can make it easier to index the data frame as we can now use row ‘names’ instead of row ‘indices’:

# Get certain rows and certain columns as a data frame
subset = df_indexed.loc[['Justin Gatlin', 'Jimmy Vicaut'], ['Country', 'Age']]
print(subset)
##                      Country  Age
## Athlete                          
## Justin Gatlin  United States   34
## Jimmy Vicaut          France   24
# Get one row and all columns as a series
cols = df_indexed.loc['Jimmy Vicaut', :]
print(cols)
## Country       France
## Age               24
## Heat           10.19
## Semi-Final      9.95
## Final          10.04
## Name: Jimmy Vicaut, dtype: object
# Get one row and one column as a value
cell = df_indexed.loc['Jimmy Vicaut', 'Age']
print(cell)
## 24

5.6 Delete rows

Delete row using row name:

df_deleted = df.drop(1)
print(df_deleted)
##              Athlete        Country  Age   Heat  Semi-Final  Final
## 0         Usain Bolt        Jamaica   29  10.07        9.86   9.81
## 2    Andre De Grasse         Canada   21  10.04        9.92   9.91
## 3        Yohan Blake        Jamaica   26  10.11       10.01   9.93
## 4      Akani Simbine   South Africa   22  10.14        9.98   9.94
## 5  Ben Youssef Meïté    Ivory Coast   29  10.03        9.97   9.96
## 6       Jimmy Vicaut         France   24  10.19        9.95  10.04
## 7    Trayvon Bromell  United States   21  10.13       10.01  10.06

Delete row using row index (the below example looks exactly the same as the above example because the row names and row indices of our data frame happen to be the same, but this will not necessarily always be the case!):

df_deleted = df.drop(df.index[1])
print(df_deleted)
##              Athlete        Country  Age   Heat  Semi-Final  Final
## 0         Usain Bolt        Jamaica   29  10.07        9.86   9.81
## 2    Andre De Grasse         Canada   21  10.04        9.92   9.91
## 3        Yohan Blake        Jamaica   26  10.11       10.01   9.93
## 4      Akani Simbine   South Africa   22  10.14        9.98   9.94
## 5  Ben Youssef Meïté    Ivory Coast   29  10.03        9.97   9.96
## 6       Jimmy Vicaut         France   24  10.19        9.95  10.04
## 7    Trayvon Bromell  United States   21  10.13       10.01  10.06

Delete multiple rows by name:

df_deleted = df.drop([1, 2, 5])
print(df_deleted)
##            Athlete        Country  Age   Heat  Semi-Final  Final
## 0       Usain Bolt        Jamaica   29  10.07        9.86   9.81
## 3      Yohan Blake        Jamaica   26  10.11       10.01   9.93
## 4    Akani Simbine   South Africa   22  10.14        9.98   9.94
## 6     Jimmy Vicaut         France   24  10.19        9.95  10.04
## 7  Trayvon Bromell  United States   21  10.13       10.01  10.06

Delete rows that have a certain value in a particular column (in this example: delete all rows where the athlete is 30 years old):

df.drop(df[df['Age'] == 30].index)

Delete rows that have one of a list of certain values in a particular column (in this example: delete all rows where the athlete is 30, 31 or 32 years old):

df.drop(df[df['Age'].isin([30, 31, 32])].index)

6 Columns in a Data Frame

6.1 Append a column

To modify an existing data frame by adding columns:

df['Reaction Time'] = [0.155, 0.152, 0.141, 0.145, 0.128, 0.156, 0.140, 0.135]
df['Finish Position'] = ['Gold', 'Silver', 'Bronze', '4th', '5th', '6th', '7th', '8th']
print(df)
##              Athlete        Country  Age   Heat  Semi-Final  Final  Reaction Time Finish Position
## 0         Usain Bolt        Jamaica   29  10.07        9.86   9.81          0.155            Gold
## 1      Justin Gatlin  United States   34  10.01        9.94   9.89          0.152          Silver
## 2    Andre De Grasse         Canada   21  10.04        9.92   9.91          0.141          Bronze
## 3        Yohan Blake        Jamaica   26  10.11       10.01   9.93          0.145             4th
## 4      Akani Simbine   South Africa   22  10.14        9.98   9.94          0.128             5th
## 5  Ben Youssef Meïté    Ivory Coast   29  10.03        9.97   9.96          0.156             6th
## 6       Jimmy Vicaut         France   24  10.19        9.95  10.04          0.140             7th
## 7    Trayvon Bromell  United States   21  10.13       10.01  10.06          0.135             8th

To create a new data frame that has a copy of the original data plus the new column(s), use the .assign() method:

df_new = df.assign(Other=['A', 'B', 'C', 'D', 'E', 'F', 'G', 'H'])
print(df_new)
##              Athlete        Country  Age   Heat  Semi-Final  Final  Reaction Time Finish Position Other
## 0         Usain Bolt        Jamaica   29  10.07        9.86   9.81          0.155            Gold     A
## 1      Justin Gatlin  United States   34  10.01        9.94   9.89          0.152          Silver     B
## 2    Andre De Grasse         Canada   21  10.04        9.92   9.91          0.141          Bronze     C
## 3        Yohan Blake        Jamaica   26  10.11       10.01   9.93          0.145             4th     D
## 4      Akani Simbine   South Africa   22  10.14        9.98   9.94          0.128             5th     E
## 5  Ben Youssef Meïté    Ivory Coast   29  10.03        9.97   9.96          0.156             6th     F
## 6       Jimmy Vicaut         France   24  10.19        9.95  10.04          0.140             7th     G
## 7    Trayvon Bromell  United States   21  10.13       10.01  10.06          0.135             8th     H

The .assign() method is slower (because it has to create a copy of the data) but it allows you to keep the old, unmodified data frame in addition to having the new, modified one.

Appending a column inside a loop:
The incorrect way:

new_df = df['Athlete']  # This creates a SERIES, not a DATA FRAME
for col in ['Semi-Final', 'Final']:
    new_df[col] = df[col]  # This does not work as expected

The correct way (although it might generate a SettingWithCopy warning):

new_df = df.loc[:, ['Athlete']]  # This creates a DATA FRAME
for col in ['Semi-Final', 'Final']:
    new_df[col] = df[col]  # This works as expected
print(new_df)
##              Athlete  Semi-Final  Final
## 0         Usain Bolt        9.86   9.81
## 1      Justin Gatlin        9.94   9.89
## 2    Andre De Grasse        9.92   9.91
## 3        Yohan Blake       10.01   9.93
## 4      Akani Simbine        9.98   9.94
## 5  Ben Youssef Meïté        9.97   9.96
## 6       Jimmy Vicaut        9.95  10.04
## 7    Trayvon Bromell       10.01  10.06

The correct way (which will not generate a SettingWithCopy warning):

new_df = df.loc[:, ['Athlete']]
for col in ['Semi-Final', 'Final']:
    new_df[col] = df[col].copy()
print(new_df)
##              Athlete  Semi-Final  Final
## 0         Usain Bolt        9.86   9.81
## 1      Justin Gatlin        9.94   9.89
## 2    Andre De Grasse        9.92   9.91
## 3        Yohan Blake       10.01   9.93
## 4      Akani Simbine        9.98   9.94
## 5  Ben Youssef Meïté        9.97   9.96
## 6       Jimmy Vicaut        9.95  10.04
## 7    Trayvon Bromell       10.01  10.06

6.2 Insert a column

‘Appending’ a column will place it on the end, whereas ‘inserting’ a column will place it at the location specified (in this case it will be at position 1).

new_col = [
    'Jamaica', 'United States', 'Canada', 'Jamaica', 'South Africa', 'Ivory Coast', 'France', 'United States'
]
new_df.insert(1, 'Country', new_col)
print(new_df)
##              Athlete        Country  Semi-Final  Final
## 0         Usain Bolt        Jamaica        9.86   9.81
## 1      Justin Gatlin  United States        9.94   9.89
## 2    Andre De Grasse         Canada        9.92   9.91
## 3        Yohan Blake        Jamaica       10.01   9.93
## 4      Akani Simbine   South Africa        9.98   9.94
## 5  Ben Youssef Meïté    Ivory Coast        9.97   9.96
## 6       Jimmy Vicaut         France        9.95  10.04
## 7    Trayvon Bromell  United States       10.01  10.06

6.3 Concatenate a data frame as new columns

Instead of adding one column to the side of a data frame, now we are adding a whole data frame onto the side:

new_col_df = pd.DataFrame({
    'Reaction Time': [0.155, 0.152, 0.141, 0.145, 0.128, 0.156, 0.140, 0.135],
    'Finish Position': ['Gold', 'Silver', 'Bronze', '4th', '5th', '6th', '7th', '8th'],
})
new_df = pd.concat([new_df, new_col_df], axis=1)
print(new_df)
##              Athlete        Country  Semi-Final  Final  Reaction Time Finish Position
## 0         Usain Bolt        Jamaica        9.86   9.81          0.155            Gold
## 1      Justin Gatlin  United States        9.94   9.89          0.152          Silver
## 2    Andre De Grasse         Canada        9.92   9.91          0.141          Bronze
## 3        Yohan Blake        Jamaica       10.01   9.93          0.145             4th
## 4      Akani Simbine   South Africa        9.98   9.94          0.128             5th
## 5  Ben Youssef Meïté    Ivory Coast        9.97   9.96          0.156             6th
## 6       Jimmy Vicaut         France        9.95  10.04          0.140             7th
## 7    Trayvon Bromell  United States       10.01  10.06          0.135             8th

6.4 Merge two data frames

‘Merging’ is a more complicated form of combining data frames than ‘appending’ or ‘concatenating’. The two objects get combined such that where they have matching values the data gets added together in the same row as opposed to in a new row. It’s complicated enough that I’ve made a whole page on merging.

6.5 Join Columns

Use the agg() method to aggregate two (or more) columns of text into one with a custom substring as a conjunction:

# Join multiple columns of text
ser = df[['Finish Position', 'Athlete']].agg(': '.join, axis=1)

print(ser)
## 0           Gold: Usain Bolt
## 1      Silver: Justin Gatlin
## 2    Bronze: Andre De Grasse
## 3           4th: Yohan Blake
## 4         5th: Akani Simbine
## 5     6th: Ben Youssef Meïté
## 6          7th: Jimmy Vicaut
## 7       8th: Trayvon Bromell
## dtype: object

6.6 Rename the columns

Rename one column as follows:

# Format is {'old_name': 'new_name'}
df_renamed = df.rename(columns={'Athlete': 'Name'})
print(list(df_renamed))
## ['Name', 'Country', 'Age', 'Heat', 'Semi-Final', 'Final', 'Reaction Time', 'Finish Position']

Rename multiple columns as follows:

# Format is {'old_name': 'new_name'}
df_renamed = df.rename(
    columns={
        'Heat': 'Race 1',
        'Semi-Final': 'Race 2',
        'Final': 'Race 3',
    }
)
print(list(df_renamed))
## ['Athlete', 'Country', 'Age', 'Race 1', 'Race 2', 'Race 3', 'Reaction Time', 'Finish Position']

Alternatively, you can use a dictionary comprehension to rename all columns at once:

new_names = ['Name', 'Nation', 'Years', 'Race 1', 'Race 2', 'Race 3', 'Position', 'Reaction']
dct = {old: new for old, new in zip(list(df), new_names)}
df_renamed = df.rename(columns=dct)
print(list(df_renamed))
## ['Name', 'Nation', 'Years', 'Race 1', 'Race 2', 'Race 3', 'Position', 'Reaction']

6.7 Delete columns

Delete column by column name:

df = df.drop('Reaction Time', axis=1)
print(df)
##              Athlete        Country  Age   Heat  Semi-Final  Final Finish Position
## 0         Usain Bolt        Jamaica   29  10.07        9.86   9.81            Gold
## 1      Justin Gatlin  United States   34  10.01        9.94   9.89          Silver
## 2    Andre De Grasse         Canada   21  10.04        9.92   9.91          Bronze
## 3        Yohan Blake        Jamaica   26  10.11       10.01   9.93             4th
## 4      Akani Simbine   South Africa   22  10.14        9.98   9.94             5th
## 5  Ben Youssef Meïté    Ivory Coast   29  10.03        9.97   9.96             6th
## 6       Jimmy Vicaut         France   24  10.19        9.95  10.04             7th
## 7    Trayvon Bromell  United States   21  10.13       10.01  10.06             8th

Delete column(s) by index:

df = df.drop(df.columns[[6]], axis=1)
print(df)
##              Athlete        Country  Age   Heat  Semi-Final  Final
## 0         Usain Bolt        Jamaica   29  10.07        9.86   9.81
## 1      Justin Gatlin  United States   34  10.01        9.94   9.89
## 2    Andre De Grasse         Canada   21  10.04        9.92   9.91
## 3        Yohan Blake        Jamaica   26  10.11       10.01   9.93
## 4      Akani Simbine   South Africa   22  10.14        9.98   9.94
## 5  Ben Youssef Meïté    Ivory Coast   29  10.03        9.97   9.96
## 6       Jimmy Vicaut         France   24  10.19        9.95  10.04
## 7    Trayvon Bromell  United States   21  10.13       10.01  10.06

7 Iterating over a Data Frame

7.1 Row-by-row

Iterating over a data frame’s rows using .iterrows()

for i, row in df.iterrows():
    row_num = i
    person = row['Athlete']
    age = row['Age']
    print(i, person, age)
## 0 Usain Bolt 29
## 1 Justin Gatlin 34
## 2 Andre De Grasse 21
## 3 Yohan Blake 26
## 4 Akani Simbine 22
## 5 Ben Youssef Meïté 29
## 6 Jimmy Vicaut 24
## 7 Trayvon Bromell 21

Iterating over a data frame’s rows using .itertuples()

for row in df.itertuples():
    row_num = row.Index
    person = row.Athlete
    age = row.Age
    print(row_num, person, age)
## 0 Usain Bolt 29
## 1 Justin Gatlin 34
## 2 Andre De Grasse 21
## 3 Yohan Blake 26
## 4 Akani Simbine 22
## 5 Ben Youssef Meïté 29
## 6 Jimmy Vicaut 24
## 7 Trayvon Bromell 21

7.2 Column-by-column

Iterate over a data frame’s columns using .items()

for colname, col in df.items():
    Gatlin = col[1]
    Simbine = col[4]
    print(colname, Gatlin, Simbine)
## Athlete Justin Gatlin Akani Simbine
## Country United States South Africa
## Age 34 22
## Heat 10.01 10.14
## Semi-Final 9.94 9.98
## Final 9.89 9.94

Iterate over a data frame’s columns using the column names

for colname in list(df):
    Bolt = df[colname][1]
    print(Bolt)
## Justin Gatlin
## United States
## 34
## 10.01
## 9.94
## 9.89

7.3 Cell-by-cell

Iterate over every cell in a data frame by using BOTH .iterrows() and .items()

# Use a smaller data frame for this example
dct = {
    'Athlete': ['Usain Bolt', 'Justin Gatlin', 'Andre De Grasse'],
    'Country': ['Jamaica', 'United States', 'Canada'],
    'Age': [29, 34, 21],
    'Heat': [10.07, 10.01, 10.04],
    'Semi-Final': [9.86, 9.94, 9.92],
    'Final': [9.81, 9.89, 9.91],
}
df_small = pd.DataFrame(dct)
# Iterate over every cell
for i, row in df_small.iterrows():
    for colname, cell in row.items():
        print(f'{colname:10} {cell}')
## Athlete    Usain Bolt
## Country    Jamaica
## Age        29
## Heat       10.07
## Semi-Final 9.86
## Final      9.81
## Athlete    Justin Gatlin
## Country    United States
## Age        34
## Heat       10.01
## Semi-Final 9.94
## Final      9.89
## Athlete    Andre De Grasse
## Country    Canada
## Age        21
## Heat       10.04
## Semi-Final 9.92
## Final      9.91

Iterate over every cell in a single row by reducing the data frame to a single row then iterating through the columns:

# Iterate over every cell in row 3 (index number 2)
idx = 2
for colname, col in df[idx:idx + 1].items():
    cell = col.values[0]
    print(cell)
## Andre De Grasse
## Canada
## 21
## 10.04
## 9.92
## 9.91

8 Indexing a Data Frame

‘Indexing’ is the method used to select or extract data in a data frame.

8.1 Extract multiple rows and columns

There are two methods that can be used to get certain rows and columns out of a data frame:

  • .loc[] uses the names of the columns you want
  • .iloc[] uses the indices of the columns you want

Remember that row names are the same as row indices, so both methods use these.

Columns names should be provided as lists (see the ‘extract one cell’ section to see what happens if you don’t) while column indices and row indices can be provided as either lists or slices. The advantage of using slices is that you can quickly provide many numbers; the disadvantage is that these numbers have to be consecutive.

Here’s a demonstration of .loc[] and .iloc[] at work using lists:

# Format is [[row_idxs], [col_names]]
subset = df.loc[[2, 4], ['Athlete', 'Final']]
print(subset)
##            Athlete  Final
## 2  Andre De Grasse   9.91
## 4    Akani Simbine   9.94
# Format is [[row_idxs], [col_idxs]]
subset = df.iloc[[2, 4], [0, 5]]
print(subset)
##            Athlete  Final
## 2  Andre De Grasse   9.91
## 4    Akani Simbine   9.94

Extract multiple consecutive rows and columns:
As mentioned, using slices instead of lists when specifying indices can be more efficient but only works when extracting consecutive rows and columns:

# Format is [slice, [col_names]] to get consecutive rows
subset = df.loc[2:4, ['Athlete', 'Country', 'Age']]
print(subset)
##            Athlete       Country  Age
## 2  Andre De Grasse        Canada   21
## 3      Yohan Blake       Jamaica   26
## 4    Akani Simbine  South Africa   22
# Format is [slice, slice] to get consecutive rows and cols
subset = df.iloc[2:5, :3]
print(subset)
##            Athlete       Country  Age
## 2  Andre De Grasse        Canada   21
## 3      Yohan Blake       Jamaica   26
## 4    Akani Simbine  South Africa   22

Also note that in the above examples .loc[] took 2:4 as its slice of row indices while .iloc[] took 2:5 and yet they returned the same rows! The methods handle slices differently!

8.2 Extract multiple rows

Similar to the above, .loc[] and .iloc[] can be used to get multiple rows (and all columns). This is a rare occasion where both methods are used in the exact same way!

# Format is [[row_idxs], :] to get multiple rows and all columns
rows = df.loc[[1, 3], :]
print(rows)
##          Athlete        Country  Age   Heat  Semi-Final  Final
## 1  Justin Gatlin  United States   34  10.01        9.94   9.89
## 3    Yohan Blake        Jamaica   26  10.11       10.01   9.93
# Format is [[row_idxs], :] to get multiple rows and all columns
rows = df.iloc[[1, 3], :]
print(rows)
##          Athlete        Country  Age   Heat  Semi-Final  Final
## 1  Justin Gatlin  United States   34  10.01        9.94   9.89
## 3    Yohan Blake        Jamaica   26  10.11       10.01   9.93

Extract multiple consecutive rows:
Use slices instead of lists for the row indices/names:

# Format is [slice, :] to get consecutive rows and all columns
rows = df.loc[2:3, :]
print(rows)
##            Athlete  Country  Age   Heat  Semi-Final  Final
## 2  Andre De Grasse   Canada   21  10.04        9.92   9.91
## 3      Yohan Blake  Jamaica   26  10.11       10.01   9.93
# Format is [slice, :] to get consecutive rows and all columns
rows = df.iloc[2:4, :]
print(rows)
##            Athlete  Country  Age   Heat  Semi-Final  Final
## 2  Andre De Grasse   Canada   21  10.04        9.92   9.91
## 3      Yohan Blake  Jamaica   26  10.11       10.01   9.93

Once again note that .loc[] and .iloc[] handle slices differently (2:3 and 2:4 produced the same result!).

Bonus: you can index a data frame directly (ie without using a function or method) with a slice to get consecutive rows:

# Format is [slice] to get consecutive rows
rows = df[slice(2, 4)]
print(rows)
##            Athlete  Country  Age   Heat  Semi-Final  Final
## 2  Andre De Grasse   Canada   21  10.04        9.92   9.91
## 3      Yohan Blake  Jamaica   26  10.11       10.01   9.93
rows = df[2:4]
print(rows)
##            Athlete  Country  Age   Heat  Semi-Final  Final
## 2  Andre De Grasse   Canada   21  10.04        9.92   9.91
## 3      Yohan Blake  Jamaica   26  10.11       10.01   9.93

8.3 Extract one row

Extracting just one row uses the same process as extracting multiple rows, except this time only one row is specified. Any of the three aforementioned methods can be used (.loc[], .iloc[] and direct indexing).

To extract the row as a data frame (ie as a spreadsheet with only one row), use a list (ie have square brackets around the row index) or slice:

# Format is [[number], :] to get one row and all columns as a data frame
row = df.loc[[5], :]
print(row)
##              Athlete      Country  Age   Heat  Semi-Final  Final
## 5  Ben Youssef Meïté  Ivory Coast   29  10.03        9.97   9.96
# Format is [[number], :] to get one row and all columns as a data frame
row = df.iloc[[5], :]
print(row)
##              Athlete      Country  Age   Heat  Semi-Final  Final
## 5  Ben Youssef Meïté  Ivory Coast   29  10.03        9.97   9.96
# Format is [slice(x, x + 1)] to get one row and all columns as a data frame
row = df[slice(5, 6)]
print(row)
##              Athlete      Country  Age   Heat  Semi-Final  Final
## 5  Ben Youssef Meïté  Ivory Coast   29  10.03        9.97   9.96
# Format is [x:x + 1] to get one row and all columns as a data frame
row = df[5:6]
print(row)
##              Athlete      Country  Age   Heat  Semi-Final  Final
## 5  Ben Youssef Meïté  Ivory Coast   29  10.03        9.97   9.96

To extract the row as a series (ie as a column with only one row), use a value (ie do not have square brackets around the row index):

# Format is [number, :] to get one row and all columns as a series
row = df.loc[5, :]
print(row)
## Athlete       Ben Youssef Meïté
## Country             Ivory Coast
## Age                          29
## Heat                      10.03
## Semi-Final                 9.97
## Final                      9.96
## Name: 5, dtype: object
# Format is [number, :] to get one row and all columns as a series
row = df.iloc[5, :]
print(row)
## Athlete       Ben Youssef Meïté
## Country             Ivory Coast
## Age                          29
## Heat                      10.03
## Semi-Final                 9.97
## Final                      9.96
## Name: 5, dtype: object
# Format is [number] to get one row and all columns as a series
row = df.loc[5]
print(row)
## Athlete       Ben Youssef Meïté
## Country             Ivory Coast
## Age                          29
## Heat                      10.03
## Semi-Final                 9.97
## Final                      9.96
## Name: 5, dtype: object

Extract one row and only certain columns:
If you use any of the above three methods to extract a single row as a series you can then use direct indexing to extract only certain columns:

# Format is [number, :][[col_names]] to get one row and certain columns as a series
row = df.loc[5, :][['Athlete', 'Semi-Final', 'Final']]
print(row)
## Athlete       Ben Youssef Meïté
## Semi-Final                 9.97
## Final                      9.96
## Name: 5, dtype: object
# Format is [number, :][[col_names]] to get one row and certain columns as a series
row = df.iloc[5, :][['Athlete', 'Semi-Final', 'Final']]
print(row)
## Athlete       Ben Youssef Meïté
## Semi-Final                 9.97
## Final                      9.96
## Name: 5, dtype: object
# Format is [number][[col_names]] to get one row and certain columns as a series
row = df.loc[5][['Athlete', 'Semi-Final', 'Final']]
print(row)
## Athlete       Ben Youssef Meïté
## Semi-Final                 9.97
## Final                      9.96
## Name: 5, dtype: object

In practice, only the last of these three methods makes sense to use because the first two tell Python to extract all columns and then to only extract certain columns!

8.4 Extract multiple columns

.loc[] and .iloc[] can also be used to get multiple columns (and all rows):

# Format is [:, [col_names]] to get all rows and certain columns
cols = df.loc[:, ['Athlete', 'Age']]
print(cols)
##              Athlete  Age
## 0         Usain Bolt   29
## 1      Justin Gatlin   34
## 2    Andre De Grasse   21
## 3        Yohan Blake   26
## 4      Akani Simbine   22
## 5  Ben Youssef Meïté   29
## 6       Jimmy Vicaut   24
## 7    Trayvon Bromell   21
# Format is [:, [col_idxs]] to get all rows and certain columns
cols = df.iloc[:, [0, 2]]
print(cols)
##              Athlete  Age
## 0         Usain Bolt   29
## 1      Justin Gatlin   34
## 2    Andre De Grasse   21
## 3        Yohan Blake   26
## 4      Akani Simbine   22
## 5  Ben Youssef Meïté   29
## 6       Jimmy Vicaut   24
## 7    Trayvon Bromell   21

Bonus: you can index a data frame directly (ie without using a function or method) with a list to get columns:

# Format is [[col_names]] to get all rows and certain columns
cols = df[['Athlete', 'Age']]
print(cols)
##              Athlete  Age
## 0         Usain Bolt   29
## 1      Justin Gatlin   34
## 2    Andre De Grasse   21
## 3        Yohan Blake   26
## 4      Akani Simbine   22
## 5  Ben Youssef Meïté   29
## 6       Jimmy Vicaut   24
## 7    Trayvon Bromell   21
# Format is [list(col_names)] to get all rows and certain columns
cols = df[list(('Athlete', 'Age'))]
print(cols)
##              Athlete  Age
## 0         Usain Bolt   29
## 1      Justin Gatlin   34
## 2    Andre De Grasse   21
## 3        Yohan Blake   26
## 4      Akani Simbine   22
## 5  Ben Youssef Meïté   29
## 6       Jimmy Vicaut   24
## 7    Trayvon Bromell   21

Extract multiple consecutive columns:
Use slices instead of lists for the column indices. It should be clear that this can only be used together with .iloc[].

# Format is [:, slice] to get all rows and consecutive columns
columns = df.iloc[:, :3]
print(columns)
##              Athlete        Country  Age
## 0         Usain Bolt        Jamaica   29
## 1      Justin Gatlin  United States   34
## 2    Andre De Grasse         Canada   21
## 3        Yohan Blake        Jamaica   26
## 4      Akani Simbine   South Africa   22
## 5  Ben Youssef Meïté    Ivory Coast   29
## 6       Jimmy Vicaut         France   24
## 7    Trayvon Bromell  United States   21

8.5 Extract one column

Extracting just one column uses the same process as extracting multiple columns, except this time only one column is specified. Any of the three aforementioned methods can be used: .loc[], .iloc[] and direct indexing.

To extract the column as a data frame (ie as a spreadsheet with only one column), use a list (ie have square brackets around the column name/index):

# Format is [:, ['col_name']] to get all rows and one column as a data frame
age = df.loc[:, ['Age']]
# Format is [:, [col_idx]] to get all rows and one column as a data frame
age = df.iloc[:, [2]]
# Format is [[col_idx]] to get all rows and one column as a data frame
age = df[['Age']]
print(age)
##    Age
## 0   29
## 1   34
## 2   21
## 3   26
## 4   22
## 5   29
## 6   24
## 7   21

To extract the column as a series (ie as a ‘column object’), use a value (ie do not have square brackets around the column name/index):

# Format is [:, 'col_name'] to get all rows and one column as a series
age = df.loc[:, 'Age']
# Format is [:, col_idx] to get all rows and one column as a series
age = df.iloc[:, 2]
# Format is [col_idx] to get all rows and one column as a series
age = df['Age']
print(age)
## 0    29
## 1    34
## 2    21
## 3    26
## 4    22
## 5    29
## 6    24
## 7    21
## Name: Age, dtype: int64

Bonus: you can use the direct indexing method on any data frame, which means that the following will work:

# Format is [:][col_name] to get all rows and one column as a series
age = df.loc[:]['Age']
print(age)
## 0    29
## 1    34
## 2    21
## 3    26
## 4    22
## 5    29
## 6    24
## 7    21
## Name: Age, dtype: int64

To extract the column as a list (ie get just the values in the column) use the .tolist() method:

age = df['Age'].tolist()
print(age)
## [29, 34, 21, 26, 22, 29, 24, 21]

Extract one column and only certain rows:
Any of the above methods that use .loc[] or .iloc[] can be simply adapted to extract only certain rows by replacing the colon with either a list or a slice:

# Format is [[row_idx], 'col_name'] to get certain rows and one column as a series
age = df.loc[[2, 3], 'Athlete']
print(age)
## 2    Andre De Grasse
## 3        Yohan Blake
## Name: Athlete, dtype: object
# Format is [slice, col_idx] to get certain, consecutive rows and one column as a series
age = df.iloc[2:4, 0]
print(age)
## 2    Andre De Grasse
## 3        Yohan Blake
## Name: Athlete, dtype: object

Bonus: again, because you can use the direct indexing method on any data frame, the following is also an option:

# Format is [slice][col_name] to get certain, consecutive rows and one column as a series
age = df.loc[2:3]['Athlete']
print(age)
## 2    Andre De Grasse
## 3        Yohan Blake
## Name: Athlete, dtype: object

8.6 Extract one cell

The most correct way to extract a single value from a data frame or series is to use .at[] or .iat[] depending on whether you want to use the name or the index of the column, respectively.

.at[] uses the name of the column you are interested in:

# Format is [row_idx, 'col_name']
cell = df.at[3, 'Athlete']
print(cell)
## Yohan Blake

.iat[] uses the index of the column you are interested in:

# Format is [row_idx, col_idx]
cell = df.iat[3, 0]
print(cell)
## Yohan Blake

Note that it is also possible (but less ‘correct’) to use .loc[] and .iloc[] to extract a single cell. These methods are actually meant to be used for extracting groups of rows and columns, but if you use them to extract a group of one row and one column it’s exactly the same as extracting a single cell!

First, this is how you use .loc[] and .iloc[] to extract single cells as data frames (ie spreadsheets, which in this case only have one cell):

# Format is [[row_idxs], [col_names]] to get a data frame
cell = df.loc[[3], ['Athlete']]
print(cell)
##        Athlete
## 3  Yohan Blake
# Format is [[row_idxs], [col_idxs]] to get a data frame
cell = df.iloc[[3], [0]]
print(cell)
##        Athlete
## 3  Yohan Blake

Extract single cells as series (ie columns, which in this case only have one cell) by providing either the row or the column as a value, not as a list (ie don’t use square brackets). Whether you provide the row or the column as a value will determine whether your result is named after it’s row or it’s column:

# Format is [[row_idx], col_name] to get a cell named after it's column
cell = df.loc[[3], 'Athlete']
print(cell)
## 3    Yohan Blake
## Name: Athlete, dtype: object
# Format is [row_idx, [col_idxs]] to get a cell named after it's row
cell = df.iloc[3, [0]]
print(cell)
## Athlete    Yohan Blake
## Name: 3, dtype: object

To use .loc[] and .iloc[] to extract single values as strings or numbers, provide both the row and the column as values as opposed to as lists:

# Format is [row_idx, 'col_name'] if you want a single cell
cell = df.loc[3, 'Athlete']
print(cell)
## Yohan Blake
# Format is [row_idx, col_idx] if you want a single cell
cell = df.iloc[3, 0]
print(cell)
## Yohan Blake

Finally, you can also use a mixture of these methods and direct indexing to extract single values:

# Format is [row_idx]['col_name'] if you want a single cell
cell = df.loc[3]['Athlete']
print(cell)
## Yohan Blake
# Format is [row_idx][col_idx] if you want a single cell
cell = df.iloc[3][0]
print(cell)
## Yohan Blake

9 Sorting

9.1 Sort rows

Ascending order:
Sort on one column:

df_sorted = df.sort_values('Athlete')
print(df_sorted)
##              Athlete        Country  Age   Heat  Semi-Final  Final
## 4      Akani Simbine   South Africa   22  10.14        9.98   9.94
## 2    Andre De Grasse         Canada   21  10.04        9.92   9.91
## 5  Ben Youssef Meïté    Ivory Coast   29  10.03        9.97   9.96
## 6       Jimmy Vicaut         France   24  10.19        9.95  10.04
## 1      Justin Gatlin  United States   34  10.01        9.94   9.89
## 7    Trayvon Bromell  United States   21  10.13       10.01  10.06
## 0         Usain Bolt        Jamaica   29  10.07        9.86   9.81
## 3        Yohan Blake        Jamaica   26  10.11       10.01   9.93

Sort first by one column, then by another:

df_sorted = df.sort_values(['Country', 'Age'])
print(df_sorted)
##              Athlete        Country  Age   Heat  Semi-Final  Final
## 2    Andre De Grasse         Canada   21  10.04        9.92   9.91
## 6       Jimmy Vicaut         France   24  10.19        9.95  10.04
## 5  Ben Youssef Meïté    Ivory Coast   29  10.03        9.97   9.96
## 3        Yohan Blake        Jamaica   26  10.11       10.01   9.93
## 0         Usain Bolt        Jamaica   29  10.07        9.86   9.81
## 4      Akani Simbine   South Africa   22  10.14        9.98   9.94
## 7    Trayvon Bromell  United States   21  10.13       10.01  10.06
## 1      Justin Gatlin  United States   34  10.01        9.94   9.89

Descending order:

df_sorted = df.sort_values('Athlete', ascending=False, na_position='first')
print(df_sorted)
##              Athlete        Country  Age   Heat  Semi-Final  Final
## 3        Yohan Blake        Jamaica   26  10.11       10.01   9.93
## 0         Usain Bolt        Jamaica   29  10.07        9.86   9.81
## 7    Trayvon Bromell  United States   21  10.13       10.01  10.06
## 1      Justin Gatlin  United States   34  10.01        9.94   9.89
## 6       Jimmy Vicaut         France   24  10.19        9.95  10.04
## 5  Ben Youssef Meïté    Ivory Coast   29  10.03        9.97   9.96
## 2    Andre De Grasse         Canada   21  10.04        9.92   9.91
## 4      Akani Simbine   South Africa   22  10.14        9.98   9.94

9.2 Re-order columns

Using the columns’ names:
The method below can also be used to remove a column: just leave it out of the indexing.

df_reordered = df[['Heat', 'Semi-Final', 'Final', 'Athlete', 'Country', 'Age']]
print(list(df_reordered))
## ['Heat', 'Semi-Final', 'Final', 'Athlete', 'Country', 'Age']

Using the columns’ indices:

cols = list(df)
cols = cols[:2] + [cols[-1]] + cols[2:-1]
df_reordered = df[cols]
print(list(df_reordered))
## ['Athlete', 'Country', 'Final', 'Age', 'Heat', 'Semi-Final']

10 Export

For more ways to export data frames, see the page on file IO.

# Export data frame to comma-separated values (.csv)
df.to_csv('data frame.csv', index=False)

⇦ Back