⇦ Back

This page is the second part of a series on data frames:

  1. Introduction to Data Frames
  2. Searching, Finding & Filtering Data Frames
  3. Calculations with Data Frames
  4. Multi-Level Data Frames

1 Intro

While ‘indexing’ is something you do when you know exactly which row(s) and column(s) you want, sometimes you do NOT know exact what you want. Instead, you first need to search through the data frame to find the information you are looking for. Then you can subset the data frame to filter out any information you don’t want:

  • ‘Search’ the data frame to see if the data you are looking for exists
    • The result will be a boolean (ie True/False): the data was either found or it wasn’t
  • ‘Find’ the data if it does exist
    • The result will be the location of the data
  • ‘Filter’ the data frame so that it only contains what you want
    • The result will be a new data frame, smaller than the original one, containing only the data you want

1.1 Example Data

We will again use the results from the Rio 2016 Olympics men’s 100m finals as our example data:

import pandas as pd

rio_100m_men = {
    'Athlete': [
        'Usain Bolt', 'Justin Gatlin', 'Andre De Grasse', 'Yohan Blake', 'Akani Simbine', 'Ben Youssef Meïté',
        'Jimmy Vicaut', 'Trayvon Bromell'
    ],
    'Country': [
        'Jamaica', 'United States', 'Canada', 'Jamaica', 'South Africa', 'Ivory Coast', 'France', 'United States'
    ],
    'Age': [29, 34, 21, 26, 22, 29, 24, 21],
    'Heat': [10.07, 10.01, 10.04, 10.11, 10.14, 10.03, 10.19, 10.13],
    'Semi-Final': [9.86, 9.94, 9.92, 10.01, 9.98, 9.97, 9.95, 10.01],
    'Final': [9.81, 9.89, 9.91, 9.93, 9.94, 9.96, 10.04, 10.06],
}
df = pd.DataFrame(rio_100m_men)
print(df)
##              Athlete        Country  Age   Heat  Semi-Final  Final
## 0         Usain Bolt        Jamaica   29  10.07        9.86   9.81
## 1      Justin Gatlin  United States   34  10.01        9.94   9.89
## 2    Andre De Grasse         Canada   21  10.04        9.92   9.91
## 3        Yohan Blake        Jamaica   26  10.11       10.01   9.93
## 4      Akani Simbine   South Africa   22  10.14        9.98   9.94
## 5  Ben Youssef Meïté    Ivory Coast   29  10.03        9.97   9.96
## 6       Jimmy Vicaut         France   24  10.19        9.95  10.04
## 7    Trayvon Bromell  United States   21  10.13       10.01  10.06

2 Searching

2.1 Search a Column

Does a column contain a certain value? For example, was Yohan Blake in the final?

# Search for a value
result = 'Yohan Blake' in df['Athlete'].values
print(result)
## True

Alternatively, this type of search can be done using the .any() and .all() methods:

# Search for any occurrence of a value
result = (df['Athlete'] == 'Yohan Blake').any()
print(result)
## True
# Check if all values in a column meet a criteria
result = (df['Age'] <= 34).all()
print(result)
## True

2.2 Search the Values in a Column Individually

You can also get the results of the search for each row: for each athlete, do they come from Jamaica?

# Search each row for a value
result = df['Country'] == 'Jamaica'
print(list(result))
## [True, False, False, True, False, False, False, False]

The same thing can be done with a function: the where() function from Numpy:

import numpy as np

result = np.where(df['Country'] == 'Jamaica', True, False)
print(list(result))
## [True, False, False, True, False, False, False, False]

2.3 Search for Substrings

Instead of just searching for athletes with certain names, we can search within the names for athletes who have the letter “e” in their name, for example:

# Search for a substring
result = ['e' in i for i in df['Athlete']]
print(result)
## [False, False, True, True, True, True, False, True]

Note that the specific task of searching for a letter within a column of strings can also be performed using the .contains() method:

# Search for a substring
result = df['Athlete'].str.contains('e')
print(list(result))
## [False, False, True, True, True, True, False, True]

2.4 Append the Results of a Search as a Column

It’s a small extra step to add the results of a search to the data frame itself:

# Search for a substring
df['e in name?'] = ['e' in i for i in df['Athlete']]
print(df[['Athlete', 'e in name?']])
##              Athlete  e in name?
## 0         Usain Bolt       False
## 1      Justin Gatlin       False
## 2    Andre De Grasse        True
## 3        Yohan Blake        True
## 4      Akani Simbine        True
## 5  Ben Youssef Meïté        True
## 6       Jimmy Vicaut       False
## 7    Trayvon Bromell        True

Do the same thing but using a loop:

# Search for a substring
df['e in name?'] = False
for i, row in df.iterrows():
    if 'e' in df.at[i, 'Athlete']:
        df.at[i, 'e in name?'] = True
    else:
        df.at[i, 'e in name?'] = False
print(df[['Athlete', 'e in name?']])
##              Athlete  e in name?
## 0         Usain Bolt       False
## 1      Justin Gatlin       False
## 2    Andre De Grasse        True
## 3        Yohan Blake        True
## 4      Akani Simbine        True
## 5  Ben Youssef Meïté        True
## 6       Jimmy Vicaut       False
## 7    Trayvon Bromell        True

2.5 Search Through Multiple Columns

The logical operators ‘and’ (&) and ‘or’ (|) can be used to find rows that match very specific criteria. Notice the round brackets that are being used to group the logical evaluations together:

# Search for values across multiple columns
results = (df['Semi-Final'] == 10.01) | ((df['Final'] == 9.89) & (df['Country'] == 'United States'))
print(list(results))
## [False, True, False, True, False, False, False, True]

2.6 Search Using Inequalities

In addition to the logical operators, the following inequality symbols can be mixed and matched:

  • Greater than: >
  • Less than: <
  • Greater than or equal to: >=
  • Less than or equal to: <=
  • not equal to: !=
df['Sub-10'] = df['Final'] < 10.00
print(df[['Athlete', 'Final', 'Sub-10']])
##              Athlete  Final  Sub-10
## 0         Usain Bolt   9.81    True
## 1      Justin Gatlin   9.89    True
## 2    Andre De Grasse   9.91    True
## 3        Yohan Blake   9.93    True
## 4      Akani Simbine   9.94    True
## 5  Ben Youssef Meïté   9.96    True
## 6       Jimmy Vicaut  10.04   False
## 7    Trayvon Bromell  10.06   False

Similar to a previous example, the same thing can be done with the where() function from Numpy:

import numpy as np

df['Sub-10'] = np.where(df['Final'] < 10.00, True, False)
print(df[['Athlete', 'Final', 'Sub-10']])
##              Athlete  Final  Sub-10
## 0         Usain Bolt   9.81    True
## 1      Justin Gatlin   9.89    True
## 2    Andre De Grasse   9.91    True
## 3        Yohan Blake   9.93    True
## 4      Akani Simbine   9.94    True
## 5  Ben Youssef Meïté   9.96    True
## 6       Jimmy Vicaut  10.04   False
## 7    Trayvon Bromell  10.06   False

2.7 Search Using Inequalities Across Multiple Columns

For example, which non-American athletes ran sub-10 in the final?

# Search for values across multiple columns
results = (df['Country'] != 'United States') & (df['Final'] <= 10.00)
print(list(results))
## [True, False, True, True, True, True, False, False]

…or:

# Search for values across multiple columns
results = np.where((df['Country'] != 'United States') & (df['Final'] <= 10.00), True, False)
print(list(results))
## [True, False, True, True, True, True, False, False]

2.8 Search Through Column Names

Instead of searching through the data in a column you can also search through the names of the columns to see if the one you are looking for has been created yet. For example, has a column been created that contains the time the athletes ran in the final?

# Search for a column name
result = 'Final' in list(df)
print(result)
## True

2.9 Search Through a Subset of the Data Frame

Let’s create a subset of the original data frame by asking:

  • Which rows contain athletes from Jamaica or the United States?
  • Which columns contain data from a final (a semi-final or the final)?

then using the .loc[] method to create a subset:

rows = (df['Country'] == 'Jamaica') | (df['Country'] == 'United States')
cols = [col.endswith('Final') for col in list(df)]
subset = df.loc[rows, cols]
print(subset)
##    Semi-Final  Final
## 0        9.86   9.81
## 1        9.94   9.89
## 3       10.01   9.93
## 7       10.01  10.06

Now we can search this subset for rows that contain any sub-10 performance and return the result as a series:

ser = subset[subset < 10].any(axis='columns')
print(ser)
## 0     True
## 1     True
## 3     True
## 7    False
## dtype: bool

3 Finding

‘Finding’ is similar to ‘searching’ except you get the indices of the matches, not just whether they exist or not. For example, what row in the data frame is occupied by Andre De Grasse?

# Find
idx = df[df['Athlete'] == 'Andre De Grasse'].index
print(f'As an index object: {idx}\nAs a list: {list(idx)}')
## As an index object: Index([2], dtype='int64')
## As a list: [2]

Now that we can find indices, let’s use them to look up data in the table:

3.1 VLOOKUP Using Index

Find a value in a column corresponding to a value in a different column. For example, what was Akani Simbine’s time in his heat?

# Look up
idx = df[df['Athlete'] == 'Akani Simbine'].index
value = df.loc[idx, 'Heat']
print(f"Akani Simbine's time in his heat: {list(value)[0]}s")
## Akani Simbine's time in his heat: 10.14s

3.2 VLOOKUP Using Booleans

Instead of using indices, can search a data frame or column using any of the methods described above to generate Booleans. Those can then be used to index the data frame or column to leave only those that were ‘found in the search’. As an example, here’s how to find the non-American athletes who ran sub-10 and get their names:

# Search for values across multiple columns
results = (df['Country'] != 'United States') & (df['Final'] <= 10.00)
# Index a column with the results of the search
non_american_sub_10 = df['Athlete'][results]
print(list(non_american_sub_10))
## ['Usain Bolt', 'Andre De Grasse', 'Yohan Blake', 'Akani Simbine', 'Ben Youssef Meïté']

3.3 Find Anywhere

The previous methods can be used to find values in a single column, but what about looking in the entire data frame? The .stack() method can be useful for this:

search_result = df[df.isin(['Akani Simbine'])].stack()
row = search_result.index[0][0]
col = search_result.index[0][1]
print(f'"Akani Simbine" is at index "{row}"" in column "{col}"')
## "Akani Simbine" is at index "4"" in column "Athlete"

This method is discussed more fully on the page about multi-level data frames.

4 Filtering

Filtering removes all rows whose values do not meet certain criteria.

4.1 Filter Using a Single Criterion

As an example, here is the data of the United States athletes only:

# Filter
subset = df[df['Country'] == 'United States']
print(subset)
##            Athlete        Country  Age   Heat  Semi-Final  Final  Sub-10
## 1    Justin Gatlin  United States   34  10.01        9.94   9.89    True
## 7  Trayvon Bromell  United States   21  10.13       10.01  10.06   False

…and of the non-United States athletes only:

# Filter
subset = df[df['Country'] != 'United States']
print(subset)
##              Athlete       Country  Age   Heat  Semi-Final  Final  Sub-10
## 0         Usain Bolt       Jamaica   29  10.07        9.86   9.81    True
## 2    Andre De Grasse        Canada   21  10.04        9.92   9.91    True
## 3        Yohan Blake       Jamaica   26  10.11       10.01   9.93    True
## 4      Akani Simbine  South Africa   22  10.14        9.98   9.94    True
## 5  Ben Youssef Meïté   Ivory Coast   29  10.03        9.97   9.96    True
## 6       Jimmy Vicaut        France   24  10.19        9.95  10.04   False

Let’s take another look at the above code. It starts off as a search: df['Country'] == 'United States' will search for all rows in the “Country” column that are equal to “United States” and return a series of Booleans (True, False, etc, corresponding to the rows in which the “United States” text was found). That series of Booleans is then used to index the data frame, and only the rows corresponding to the “Trues” in the Boolean series ‘survive’ the indexing.

4.2 Filter Using a List of Criteria

Instead of matching only one criteria, here is how to filter out all rows that do not match either of two criteria: what is the data for athletes who came from either the United States or South Africa?

# Filter
subset = df[df['Country'].isin(['United States', 'South Africa'])]
print(subset)
##            Athlete        Country  Age   Heat  Semi-Final  Final  Sub-10
## 1    Justin Gatlin  United States   34  10.01        9.94   9.89    True
## 4    Akani Simbine   South Africa   22  10.14        9.98   9.94    True
## 7  Trayvon Bromell  United States   21  10.13       10.01  10.06   False

…and all athletes who came from neither the US or SA:

# Filter
subset = df[~df['Country'].isin(['United States', 'South Africa'])]
print(subset)
##              Athlete      Country  Age   Heat  Semi-Final  Final  Sub-10
## 0         Usain Bolt      Jamaica   29  10.07        9.86   9.81    True
## 2    Andre De Grasse       Canada   21  10.04        9.92   9.91    True
## 3        Yohan Blake      Jamaica   26  10.11       10.01   9.93    True
## 5  Ben Youssef Meïté  Ivory Coast   29  10.03        9.97   9.96    True
## 6       Jimmy Vicaut       France   24  10.19        9.95  10.04   False

4.3 Filter Using Multiple lists of Criteria/Multiple Columns

Using ‘and not’ logic:

subset = df[
    df['Country'].isin(['United States', 'South Africa', 'Jamaica']) & ~
    df['Age'].isin([26, 34])
]
print(subset)
##            Athlete        Country  Age   Heat  Semi-Final  Final  Sub-10
## 0       Usain Bolt        Jamaica   29  10.07        9.86   9.81    True
## 4    Akani Simbine   South Africa   22  10.14        9.98   9.94    True
## 7  Trayvon Bromell  United States   21  10.13       10.01  10.06   False

Using ‘or’ logic:

subset = df[
    df['Country'].isin(['South Africa']) |
    df['Age'].isin([26, 34])
]
print(subset)
##          Athlete        Country  Age   Heat  Semi-Final  Final  Sub-10
## 1  Justin Gatlin  United States   34  10.01        9.94   9.89    True
## 3    Yohan Blake        Jamaica   26  10.11       10.01   9.93    True
## 4  Akani Simbine   South Africa   22  10.14        9.98   9.94    True

Of course, all the above filtering examples work with both strings and numbers as the criteria.

4.4 Filter Using a Column of Booleans

If one of the columns consists of Booleans you can filter by it. In other words, all the rows with ‘False’ get removed and all the ones with ‘True’ remain:

# Filter by who ran sub-10 seconds
subset = df[df['Sub-10']]
print(subset)
##              Athlete        Country  Age   Heat  Semi-Final  Final  Sub-10
## 0         Usain Bolt        Jamaica   29  10.07        9.86   9.81    True
## 1      Justin Gatlin  United States   34  10.01        9.94   9.89    True
## 2    Andre De Grasse         Canada   21  10.04        9.92   9.91    True
## 3        Yohan Blake        Jamaica   26  10.11       10.01   9.93    True
## 4      Akani Simbine   South Africa   22  10.14        9.98   9.94    True
## 5  Ben Youssef Meïté    Ivory Coast   29  10.03        9.97   9.96    True

4.5 Filter To Get Values Between Two Numbers

Use the .between() method to find the values that are between two numbers, eg to find the athletes who are between 25 and 30 years of age. Use the inclusive keyword argument to specify if you want to include the athletes who are exactly 25 or 30 or not:

  • left will include the athletes who are 25 - but not those who are 30 - and all ages in-between
  • right will include the athletes who are 30 - but not those who are 25 - and all ages in-between
  • both will include the athletes who are 25, 30 and in-between
  • neither will exclude the athletes aged 25 or 30
# Get the athletes aged between 25 and 30 inclusive
subset = df[df['Age'].between(25, 30, inclusive='both')]
print(subset)
##              Athlete      Country  Age   Heat  Semi-Final  Final  Sub-10
## 0         Usain Bolt      Jamaica   29  10.07        9.86   9.81    True
## 3        Yohan Blake      Jamaica   26  10.11       10.01   9.93    True
## 5  Ben Youssef Meïté  Ivory Coast   29  10.03        9.97   9.96    True

As mentioned above, you can use multiple criteria when filtering. So you can get the athletes who are aged between 25 and 30 and who ran between 9.90 and 10.00 seconds in the final:

# Get the athletes aged between 25 and 30 inclusive who ran between 9.90 and 10.00 in the final
subset = df[
    df['Age'].between(25, 30, inclusive='both') &
    df['Final'].between(9.90, 10.00, inclusive='both') 
]
print(subset)
##              Athlete      Country  Age   Heat  Semi-Final  Final  Sub-10
## 3        Yohan Blake      Jamaica   26  10.11       10.01   9.93    True
## 5  Ben Youssef Meïté  Ivory Coast   29  10.03        9.97   9.96    True

4.6 Filter Out Nulls

Filter out all rows that have a null value in a particular column:

subset = df[df['Age'].notnull()]

Filter out all rows that contain any null values in any columns:

subset = df.dropna()

Filter out all rows that have only null values in particular columns:

subset = df.dropna(how='all', subset=['Country', 'Age'])

4.7 Removing Duplicates

Remove all rows that appear more than once:

subset = df.drop_duplicates()

Remove rows where a value appears twice in a particular column, keeping the row where that value first appears:

# Remove all repeat instances of athletes from the same country
subset = df.drop_duplicates('Country', keep='first')

# This removes Yohan Blake and Trayvon Bromell
print(subset)
##              Athlete        Country  Age   Heat  Semi-Final  Final  Sub-10
## 0         Usain Bolt        Jamaica   29  10.07        9.86   9.81    True
## 1      Justin Gatlin  United States   34  10.01        9.94   9.89    True
## 2    Andre De Grasse         Canada   21  10.04        9.92   9.91    True
## 4      Akani Simbine   South Africa   22  10.14        9.98   9.94    True
## 5  Ben Youssef Meïté    Ivory Coast   29  10.03        9.97   9.96    True
## 6       Jimmy Vicaut         France   24  10.19        9.95  10.04   False

4.7.1 BONUS: Removing Duplicates From a List Whilst Preserving the Order

As discussed on the page about lists and sets, a set will remove duplicate values from a list but not preserve their order:

values = ['One', 'Two', 'Three', 'Three']
unique_values = list(set(values))
print(unique_values)
## ['Two', 'One', 'Three']

We would want the result to be ['One', 'Two', 'Three'], not ['Two', 'One', 'Three']. By using a Pandas Series object, however, we can get this:

values = ['One', 'Two', 'Three', 'Three']
unique_values = pd.Series(values).drop_duplicates().tolist()
print(unique_values)
## ['One', 'Two', 'Three']

⇦ Back