This page is the second part of a series on data frames:
- Introduction to Data Frames
- Searching, Finding & Filtering Data Frames
- Calculations with Data Frames
- Multi-Level Data Frames
While ‘indexing’ is something you do when you know exactly which row(s) and column(s) you want, sometimes you do NOT know exact what you want. Instead, you first need to search through the data frame to find the information you are looking for. Then you can subset the data frame to filter out any information you don’t want:
We will again use the results from the Rio 2016 Olympics men’s 100m finals as our example data:
import pandas as pd
rio_100m_men = {
'Athlete': [
'Usain Bolt', 'Justin Gatlin', 'Andre De Grasse', 'Yohan Blake', 'Akani Simbine', 'Ben Youssef Meïté',
'Jimmy Vicaut', 'Trayvon Bromell'
],
'Country': [
'Jamaica', 'United States', 'Canada', 'Jamaica', 'South Africa', 'Ivory Coast', 'France', 'United States'
],
'Age': [29, 34, 21, 26, 22, 29, 24, 21],
'Heat': [10.07, 10.01, 10.04, 10.11, 10.14, 10.03, 10.19, 10.13],
'Semi-Final': [9.86, 9.94, 9.92, 10.01, 9.98, 9.97, 9.95, 10.01],
'Final': [9.81, 9.89, 9.91, 9.93, 9.94, 9.96, 10.04, 10.06],
}
df = pd.DataFrame(rio_100m_men)
print(df)
## Athlete Country Age Heat Semi-Final Final
## 0 Usain Bolt Jamaica 29 10.07 9.86 9.81
## 1 Justin Gatlin United States 34 10.01 9.94 9.89
## 2 Andre De Grasse Canada 21 10.04 9.92 9.91
## 3 Yohan Blake Jamaica 26 10.11 10.01 9.93
## 4 Akani Simbine South Africa 22 10.14 9.98 9.94
## 5 Ben Youssef Meïté Ivory Coast 29 10.03 9.97 9.96
## 6 Jimmy Vicaut France 24 10.19 9.95 10.04
## 7 Trayvon Bromell United States 21 10.13 10.01 10.06
Does a column contain a certain value? For example, was Yohan Blake in the final?
# Search for a value
result = 'Yohan Blake' in df['Athlete'].values
print(result)
## True
Alternatively, this type of search can be done using the
.any()
and .all()
methods:
# Search for any occurrence of a value
result = (df['Athlete'] == 'Yohan Blake').any()
print(result)
## True
# Check if all values in a column meet a criteria
result = (df['Age'] <= 34).all()
print(result)
## True
You can also get the results of the search for each row: for each athlete, do they come from Jamaica?
# Search each row for a value
result = df['Country'] == 'Jamaica'
print(list(result))
## [True, False, False, True, False, False, False, False]
The same thing can be done with a function: the where()
function from Numpy:
import numpy as np
result = np.where(df['Country'] == 'Jamaica', True, False)
print(list(result))
## [True, False, False, True, False, False, False, False]
Instead of just searching for athletes with certain names, we can search within the names for athletes who have the letter “e” in their name, for example:
# Search for a substring
result = ['e' in i for i in df['Athlete']]
print(result)
## [False, False, True, True, True, True, False, True]
Note that the specific task of searching for a letter within a column
of strings can also be performed using the .contains()
method:
# Search for a substring
result = df['Athlete'].str.contains('e')
print(list(result))
## [False, False, True, True, True, True, False, True]
It’s a small extra step to add the results of a search to the data frame itself:
# Search for a substring
df['e in name?'] = ['e' in i for i in df['Athlete']]
print(df[['Athlete', 'e in name?']])
## Athlete e in name?
## 0 Usain Bolt False
## 1 Justin Gatlin False
## 2 Andre De Grasse True
## 3 Yohan Blake True
## 4 Akani Simbine True
## 5 Ben Youssef Meïté True
## 6 Jimmy Vicaut False
## 7 Trayvon Bromell True
Do the same thing but using a loop:
# Search for a substring
df['e in name?'] = False
for i, row in df.iterrows():
if 'e' in df.at[i, 'Athlete']:
df.at[i, 'e in name?'] = True
else:
df.at[i, 'e in name?'] = False
print(df[['Athlete', 'e in name?']])
## Athlete e in name?
## 0 Usain Bolt False
## 1 Justin Gatlin False
## 2 Andre De Grasse True
## 3 Yohan Blake True
## 4 Akani Simbine True
## 5 Ben Youssef Meïté True
## 6 Jimmy Vicaut False
## 7 Trayvon Bromell True
The logical operators ‘and’ (&
) and ‘or’
(|
) can be used to find rows that match very specific
criteria. Notice the round brackets that are being used to group the
logical evaluations together:
# Search for values across multiple columns
results = (df['Semi-Final'] == 10.01) | ((df['Final'] == 9.89) & (df['Country'] == 'United States'))
print(list(results))
## [False, True, False, True, False, False, False, True]
In addition to the logical operators, the following inequality symbols can be mixed and matched:
>
<
>=
<=
!=
df['Sub-10'] = df['Final'] < 10.00
print(df[['Athlete', 'Final', 'Sub-10']])
## Athlete Final Sub-10
## 0 Usain Bolt 9.81 True
## 1 Justin Gatlin 9.89 True
## 2 Andre De Grasse 9.91 True
## 3 Yohan Blake 9.93 True
## 4 Akani Simbine 9.94 True
## 5 Ben Youssef Meïté 9.96 True
## 6 Jimmy Vicaut 10.04 False
## 7 Trayvon Bromell 10.06 False
Similar to a previous example, the same thing can be done with the
where()
function from Numpy:
import numpy as np
df['Sub-10'] = np.where(df['Final'] < 10.00, True, False)
print(df[['Athlete', 'Final', 'Sub-10']])
## Athlete Final Sub-10
## 0 Usain Bolt 9.81 True
## 1 Justin Gatlin 9.89 True
## 2 Andre De Grasse 9.91 True
## 3 Yohan Blake 9.93 True
## 4 Akani Simbine 9.94 True
## 5 Ben Youssef Meïté 9.96 True
## 6 Jimmy Vicaut 10.04 False
## 7 Trayvon Bromell 10.06 False
For example, which non-American athletes ran sub-10 in the final?
# Search for values across multiple columns
results = (df['Country'] != 'United States') & (df['Final'] <= 10.00)
print(list(results))
## [True, False, True, True, True, True, False, False]
…or:
# Search for values across multiple columns
results = np.where((df['Country'] != 'United States') & (df['Final'] <= 10.00), True, False)
print(list(results))
## [True, False, True, True, True, True, False, False]
Instead of searching through the data in a column you can also search through the names of the columns to see if the one you are looking for has been created yet. For example, has a column been created that contains the time the athletes ran in the final?
# Search for a column name
result = 'Final' in list(df)
print(result)
## True
Let’s create a subset of the original data frame by asking:
then using the .loc[]
method to create a subset:
rows = (df['Country'] == 'Jamaica') | (df['Country'] == 'United States')
cols = [col.endswith('Final') for col in list(df)]
subset = df.loc[rows, cols]
print(subset)
## Semi-Final Final
## 0 9.86 9.81
## 1 9.94 9.89
## 3 10.01 9.93
## 7 10.01 10.06
Now we can search this subset for rows that contain any sub-10 performance and return the result as a series:
ser = subset[subset < 10].any(axis='columns')
print(ser)
## 0 True
## 1 True
## 3 True
## 7 False
## dtype: bool
‘Finding’ is similar to ‘searching’ except you get the indices of the matches, not just whether they exist or not. For example, what row in the data frame is occupied by Andre De Grasse?
# Find
idx = df[df['Athlete'] == 'Andre De Grasse'].index
print(f'As an index object: {idx}\nAs a list: {list(idx)}')
## As an index object: Index([2], dtype='int64')
## As a list: [2]
Now that we can find indices, let’s use them to look up data in the table:
Find a value in a column corresponding to a value in a different column. For example, what was Akani Simbine’s time in his heat?
# Look up
idx = df[df['Athlete'] == 'Akani Simbine'].index
value = df.loc[idx, 'Heat']
print(f"Akani Simbine's time in his heat: {list(value)[0]}s")
## Akani Simbine's time in his heat: 10.14s
Instead of using indices, can search a data frame or column using any of the methods described above to generate Booleans. Those can then be used to index the data frame or column to leave only those that were ‘found in the search’. As an example, here’s how to find the non-American athletes who ran sub-10 and get their names:
# Search for values across multiple columns
results = (df['Country'] != 'United States') & (df['Final'] <= 10.00)
# Index a column with the results of the search
non_american_sub_10 = df['Athlete'][results]
print(list(non_american_sub_10))
## ['Usain Bolt', 'Andre De Grasse', 'Yohan Blake', 'Akani Simbine', 'Ben Youssef Meïté']
The previous methods can be used to find values in a single column,
but what about looking in the entire data frame? The
.stack()
method can be useful for this:
search_result = df[df.isin(['Akani Simbine'])].stack()
row = search_result.index[0][0]
col = search_result.index[0][1]
print(f'"Akani Simbine" is at index "{row}"" in column "{col}"')
## "Akani Simbine" is at index "4"" in column "Athlete"
This method is discussed more fully on the page about multi-level data frames.
Filtering removes all rows whose values do not meet certain criteria.
As an example, here is the data of the United States athletes only:
# Filter
subset = df[df['Country'] == 'United States']
print(subset)
## Athlete Country Age Heat Semi-Final Final Sub-10
## 1 Justin Gatlin United States 34 10.01 9.94 9.89 True
## 7 Trayvon Bromell United States 21 10.13 10.01 10.06 False
…and of the non-United States athletes only:
# Filter
subset = df[df['Country'] != 'United States']
print(subset)
## Athlete Country Age Heat Semi-Final Final Sub-10
## 0 Usain Bolt Jamaica 29 10.07 9.86 9.81 True
## 2 Andre De Grasse Canada 21 10.04 9.92 9.91 True
## 3 Yohan Blake Jamaica 26 10.11 10.01 9.93 True
## 4 Akani Simbine South Africa 22 10.14 9.98 9.94 True
## 5 Ben Youssef Meïté Ivory Coast 29 10.03 9.97 9.96 True
## 6 Jimmy Vicaut France 24 10.19 9.95 10.04 False
Let’s take another look at the above code. It starts off as a search:
df['Country'] == 'United States'
will search for all rows
in the “Country” column that are equal to “United States” and return a
series of Booleans (True, False, etc, corresponding to
the rows in which the “United States” text was found). That series of
Booleans is then used to index the data frame, and only
the rows corresponding to the “Trues” in the Boolean series ‘survive’
the indexing.
Instead of matching only one criteria, here is how to filter out all rows that do not match either of two criteria: what is the data for athletes who came from either the United States or South Africa?
# Filter
subset = df[df['Country'].isin(['United States', 'South Africa'])]
print(subset)
## Athlete Country Age Heat Semi-Final Final Sub-10
## 1 Justin Gatlin United States 34 10.01 9.94 9.89 True
## 4 Akani Simbine South Africa 22 10.14 9.98 9.94 True
## 7 Trayvon Bromell United States 21 10.13 10.01 10.06 False
…and all athletes who came from neither the US or SA:
# Filter
subset = df[~df['Country'].isin(['United States', 'South Africa'])]
print(subset)
## Athlete Country Age Heat Semi-Final Final Sub-10
## 0 Usain Bolt Jamaica 29 10.07 9.86 9.81 True
## 2 Andre De Grasse Canada 21 10.04 9.92 9.91 True
## 3 Yohan Blake Jamaica 26 10.11 10.01 9.93 True
## 5 Ben Youssef Meïté Ivory Coast 29 10.03 9.97 9.96 True
## 6 Jimmy Vicaut France 24 10.19 9.95 10.04 False
Using ‘and not’ logic:
subset = df[
df['Country'].isin(['United States', 'South Africa', 'Jamaica']) & ~
df['Age'].isin([26, 34])
]
print(subset)
## Athlete Country Age Heat Semi-Final Final Sub-10
## 0 Usain Bolt Jamaica 29 10.07 9.86 9.81 True
## 4 Akani Simbine South Africa 22 10.14 9.98 9.94 True
## 7 Trayvon Bromell United States 21 10.13 10.01 10.06 False
Using ‘or’ logic:
subset = df[
df['Country'].isin(['South Africa']) |
df['Age'].isin([26, 34])
]
print(subset)
## Athlete Country Age Heat Semi-Final Final Sub-10
## 1 Justin Gatlin United States 34 10.01 9.94 9.89 True
## 3 Yohan Blake Jamaica 26 10.11 10.01 9.93 True
## 4 Akani Simbine South Africa 22 10.14 9.98 9.94 True
Of course, all the above filtering examples work with both strings and numbers as the criteria.
If one of the columns consists of Booleans you can filter by it. In other words, all the rows with ‘False’ get removed and all the ones with ‘True’ remain:
# Filter by who ran sub-10 seconds
subset = df[df['Sub-10']]
print(subset)
## Athlete Country Age Heat Semi-Final Final Sub-10
## 0 Usain Bolt Jamaica 29 10.07 9.86 9.81 True
## 1 Justin Gatlin United States 34 10.01 9.94 9.89 True
## 2 Andre De Grasse Canada 21 10.04 9.92 9.91 True
## 3 Yohan Blake Jamaica 26 10.11 10.01 9.93 True
## 4 Akani Simbine South Africa 22 10.14 9.98 9.94 True
## 5 Ben Youssef Meïté Ivory Coast 29 10.03 9.97 9.96 True
Use the .between()
method to find the values that are
between two numbers, eg to find the athletes who are between 25 and 30
years of age. Use the inclusive
keyword argument to specify
if you want to include the athletes who are exactly 25 or 30 or not:
left
will include the athletes who are 25 - but not
those who are 30 - and all ages in-betweenright
will include the athletes who are 30 - but not
those who are 25 - and all ages in-betweenboth
will include the athletes who are 25, 30 and
in-betweenneither
will exclude the athletes aged 25 or 30# Get the athletes aged between 25 and 30 inclusive
subset = df[df['Age'].between(25, 30, inclusive='both')]
print(subset)
## Athlete Country Age Heat Semi-Final Final Sub-10
## 0 Usain Bolt Jamaica 29 10.07 9.86 9.81 True
## 3 Yohan Blake Jamaica 26 10.11 10.01 9.93 True
## 5 Ben Youssef Meïté Ivory Coast 29 10.03 9.97 9.96 True
As mentioned above, you can use multiple criteria when filtering. So you can get the athletes who are aged between 25 and 30 and who ran between 9.90 and 10.00 seconds in the final:
# Get the athletes aged between 25 and 30 inclusive who ran between 9.90 and 10.00 in the final
subset = df[
df['Age'].between(25, 30, inclusive='both') &
df['Final'].between(9.90, 10.00, inclusive='both')
]
print(subset)
## Athlete Country Age Heat Semi-Final Final Sub-10
## 3 Yohan Blake Jamaica 26 10.11 10.01 9.93 True
## 5 Ben Youssef Meïté Ivory Coast 29 10.03 9.97 9.96 True
Filter out all rows that have a null value in a particular column:
subset = df[df['Age'].notnull()]
Filter out all rows that contain any null values in any columns:
subset = df.dropna()
Filter out all rows that have only null values in particular columns:
subset = df.dropna(how='all', subset=['Country', 'Age'])
Remove all rows that appear more than once:
subset = df.drop_duplicates()
Remove rows where a value appears twice in a particular column, keeping the row where that value first appears:
# Remove all repeat instances of athletes from the same country
subset = df.drop_duplicates('Country', keep='first')
# This removes Yohan Blake and Trayvon Bromell
print(subset)
## Athlete Country Age Heat Semi-Final Final Sub-10
## 0 Usain Bolt Jamaica 29 10.07 9.86 9.81 True
## 1 Justin Gatlin United States 34 10.01 9.94 9.89 True
## 2 Andre De Grasse Canada 21 10.04 9.92 9.91 True
## 4 Akani Simbine South Africa 22 10.14 9.98 9.94 True
## 5 Ben Youssef Meïté Ivory Coast 29 10.03 9.97 9.96 True
## 6 Jimmy Vicaut France 24 10.19 9.95 10.04 False
As discussed on the page about lists and sets, a set will remove duplicate values from a list but not preserve their order:
values = ['One', 'Two', 'Three', 'Three']
unique_values = list(set(values))
print(unique_values)
## ['Two', 'One', 'Three']
We would want the result to be ['One', 'Two', 'Three']
,
not ['Two', 'One', 'Three']
. By using a Pandas Series
object, however, we can get this:
values = ['One', 'Two', 'Three', 'Three']
unique_values = pd.Series(values).drop_duplicates().tolist()
print(unique_values)
## ['One', 'Two', 'Three']