Sometimes, you’ll come across datasets where certain columns contain only the values “False”, “None”, “NaN” or “0” and you’ll want to remove them. This isn’t entirely straightforward to do, so this page explores a few options at your disposal.
Just to be clear, this page is about removing entire columns that contain only data that isn’t useful. For removing columns or rows that partially contain missing data, check out the .dropna()
method over here.
Here’s the data frame that will be used for the examples on this page:
import pandas as pd
import numpy as np
# Set Pandas display options
pd.set_option('display.max_rows', None)
pd.set_option('display.max_columns', None)
pd.set_option('display.width', 120)
# Create a data frame
raw = pd.DataFrame({
'only_text': ['Value 1', 'Value 2', 'Value 3', 'Value 4'],
'only_trues': [True, True, True, True],
'text_and_nones': ['Value 1', 'Value 2', None, None],
'only_falses': [False, False, False, False],
'only_nones': [None, None, None, None],
'only_nans': [np.NaN, np.NaN, np.NaN, np.NaN],
'only_zeroes': [0, 0, 0, 0],
'falses_and_nones': [False, False, None, None],
'falses_and_nans': [False, False, np.NaN, np.NaN],
'falses_and_zeroes': [False, False, 0, 0],
'nones_and_nans': [None, None, np.NaN, np.NaN],
'nones_and_zeroes': [None, None, 0, 0],
'nans_and_zeroes': [np.NaN, np.NaN, 0, 0],
}, dtype='object')
print(raw)
## only_text only_trues text_and_nones only_falses only_nones only_nans only_zeroes falses_and_nones falses_and_nans \
## 0 Value 1 True Value 1 False None NaN 0 False False
## 1 Value 2 True Value 2 False None NaN 0 False False
## 2 Value 3 True None False None NaN 0 None NaN
## 3 Value 4 True None False None NaN 0 None NaN
##
## falses_and_zeroes nones_and_nans nones_and_zeroes nans_and_zeroes
## 0 False None None NaN
## 1 False None None NaN
## 2 0 NaN 0 0
## 3 0 NaN 0 0
The above code uses the Pandas and Numpy packages. Install these from the terminal with:
python3.11 -m pip install pandas
python3.11 -m pip install numpy
Where python3.11
is the version of Python you are using (in my case it’s 3.11 but replace this with what you have).
As you can see, the example data frame includes columns that have all the different combinations of data that you might want to remove - Falses, Nones, NaNs and zeroes - as well as a few columns at the start that don’t only contain data to remove.
To find the elements which are “False”:
.any()
method to return whether any elements in a column are “True”~
to get whether all elements in a column are “False”bool_only=True
keyword argument to only look at Boolean values
The columns that are found can then be removed with the .drop()
method:
# Make a copy of the data frame
df = raw.copy()
# Find which columns contain only Booleans and only Falses. Use the `bool_only=True`
# option with the `any()` method to have it only recognise Booleans:
bool_cols = ~df.any(bool_only=True)
only_falses = bool_cols[bool_cols]
# Get the names of these columns
cols_only_falses = only_falses.index.to_list()
# Remove these columns
df = df.drop(cols_only_falses, axis=1)
# Which columns have been removed?
print(set(list(raw)) - set(list(df)))
## {'only_falses'}
The only_falses
column is the only one to have been dropped.
The same output can be achieved using a loop:
# Make a copy of the data frame
df = raw.copy()
# Drop the columns that only contain 'False'
for col in list(df.any(bool_only=True)[~df.any(bool_only=True)].index):
df = df.drop(col, axis=1)
# Which columns have been removed?
print(set(list(raw)) - set(list(df)))
## {'only_falses'}
Column only_falses
has been dropped while all the others remain.
The secret to finding the values that are “None” as opposed to “NaN” is to check their data types, which can be done by applying the type
function as a map across the entire data frame:
# Make a copy of the data frame
df = raw.copy()
# Find the nones
nones = df.applymap(type) == type(None)
# Find columns with only nones
cols = nones.all()[nones.all()].index.to_list()
# Drop these columns
df = df.drop(cols, axis=1)
# Which columns have been removed?
print(set(list(raw)) - set(list(df)))
## {'only_nones'}
The only_nones
column has been dropped while all the others have been left alone.
We need to find the values that are “NA” using the .isna()
method, but this will return values that are “None” as well. These can be removed by doing the opposite of what we did above: using the .applymap()
method to look for values that do not have their type equal to “None”:
# Make a copy of the data frame
df = raw.copy()
# Find the nans
nans = (df.isna()) & (df.applymap(type) != type(None))
# Find columns with only nans
cols = nans.all()[nans.all()].index.to_list()
# Drop these columns
df = df.drop(cols, axis=1)
# Which columns have been removed?
print(set(list(raw)) - set(list(df)))
## {'only_nans'}
only_nans
is the only column to be removed.
We can search the data frame for values that are equal to “0”, but this will return values that are “False” as well! Remove those by using .applymap()
once again, this time to find values that are integers:
# Make a copy of the data frame
df = raw.copy()
# Find the zeroes
zeroes = (df == 0) & (df.applymap(type) == int)
# Find columns with only zeroes
cols = zeroes.all()[zeroes.all()].index.to_list()
# Drop these columns
df = df.drop(cols, axis=1)
# Which columns have been removed?
print(set(list(raw)) - set(list(df)))
## {'only_zeroes'}
We have removed the only_zeroes
column and left the rest.
Find the Falses by looking for values that are equal to “False” but which are not equal to “0” (as zeroes are interpreted as Falses by default). Find the Nones by searching for values of the same type as “None”. Remove the columns that only contain one or a combination of both of these:
# Make a copy of the data frame
df = raw.copy()
# Find the falses
falses = (df == False) & (df.astype(str) != '0')
# Find the nones
nones = df.applymap(type) == type(None)
# Find both the falses and the nones
falses_nones = falses | nones
# Find columns with only falses and nones
cols = falses_nones.all()[falses_nones.all()].index.to_list()
# Drop these columns
df = df.drop(cols, axis=1)
# Which columns have been removed?
print(set(list(raw)) - set(list(df)))
## {'falses_and_nones', 'only_falses', 'only_nones'}
Find the Falses and the NaNs as has been described above:
# Make a copy of the data frame
df = raw.copy()
# Find the falses
falses = (df == False) & (df.astype(str) != '0')
# Find the nans
nans = (df.isna()) & (df.applymap(type) != type(None))
# Find both the falses and the nans
falses_nans = falses | nans
# Find columns with only falses and nans
cols = falses_nans.all()[falses_nans.all()].index.to_list()
# Drop these columns
df = df.drop(cols, axis=1)
# Which columns have been removed?
print(set(list(raw)) - set(list(df)))
## {'falses_and_nans', 'only_falses', 'only_nans'}
The columns that contain only Falses, NaNs or a combination of the two - only_falses
, only_nans
and falses_and_nans
- have been removed.
Because “False” and “0” both evaluate to “0” in Python, we can simply search for columns where all the values are equal to “0” using the .all()
method:
# Make a copy of the data frame
df = raw.copy()
# Find columns that only contain falses and zeroes
only_falses_zeroes = (df == 0).all()
falses_zeroes_cols = only_falses_zeroes[only_falses_zeroes].index.to_list()
# Remove these columns which are all nones
df = df.drop(falses_zeroes_cols, axis=1)
# Which columns have been removed?
print(set(list(raw)) - set(list(df)))
## {'only_zeroes', 'only_falses', 'falses_and_zeroes'}
The columns that contain only Falses, zeroes or a combination of the two - only_falses
, only_zeroes
and falses_and_zeroes
- have been removed.
Because “None” and “NaN” are both considered to be ‘Nulls’, we can simply use the .isnull()
method to search for them and the .all()
method to search for the columns containing only them:
# Make a copy of the data frame
df = raw.copy()
# Drop the columns that only contain Nones and/or NaNs
for col in df:
if df[col].isnull().all():
df = df.drop(col, axis=1)
# Which columns have been removed?
print(set(list(raw)) - set(list(df)))
## {'only_nones', 'only_nans', 'nones_and_nans'}
Columns only_nones
, only_nans
and nones_and_nans
have been dropped.
Equivalently, the .dropna()
method can be used to drop ‘NAs’. Specifically, it can drop columns
where all
the values are NAs:
# Make a copy of the data frame
df = raw.copy()
# Drop the columns that only contain 'None' and/or 'NaN'
df = df.dropna(axis='columns', how='all')
# Which columns have been removed?
print(set(list(raw)) - set(list(df)))
## {'only_nones', 'only_nans', 'nones_and_nans'}
In general, the .dropna()
method is used to remove missing data (which means “None”s and “NaN”s). It has a how
keyword argument which determines how it chooses what to remove: in our case we want it to remove columns where all values are NAs (“None”s and “NaN”s).
Find the Nones by applying the type
function as a map to the data frame via the .applymap()
method and searching for the values that return the same type as None
. Find the zeroes by looking for which values become “0” when converted to a string (this avoids values which are equal to “False” which, usually, are considered to be the same as the number “0” by Python):
# Make a copy of the data frame
df = raw.copy()
# Find the nones
nones = df.applymap(type) == type(None)
# Find the zeroes
zeroes = df.astype(str) == '0'
# Find both the nones and the zeroes
nones_zeroes = nones | zeroes
# Find columns with only nones and nones
cols = nones_zeroes.all()[nones_zeroes.all()].index.to_list()
# Drop these columns
df = df.drop(cols, axis=1)
# Which columns have been removed?
print(set(list(raw)) - set(list(df)))
## {'only_nones', 'only_zeroes', 'nones_and_zeroes'}
The columns with only Nones, zeroes or a combination thereof have been removed.
Find the values that are “NaN” or “None” by using the .isna()
method, then remove the Nones by doing the opposite of what we did above: using the .applymap()
method to look for values that do not have their type equal to “None”. Find the zeroes as above:
# Make a copy of the data frame
df = raw.copy()
# Find the nans
nans = (df.isna()) & (df.applymap(type) != type(None))
# Find the zeroes
zeroes = df.astype(str) == '0'
# Find both the nans and the zeroes
nans_zeroes = nans | zeroes
# Find columns with only nans and zeroes
cols = nans_zeroes.all()[nans_zeroes.all()].index.to_list()
# Drop these columns
df = df.drop(cols, axis=1)
# Which columns have been removed?
print(set(list(raw)) - set(list(df)))
## {'nans_and_zeroes', 'only_zeroes', 'only_nans'}
Find the Falses, Nones and NaNs using methods described above and combine them:
# Make a copy of the data frame
df = raw.copy()
# Find the falses
falses = (df == False) & (df.astype(str) != '0')
# Find the nones
nones = df.applymap(type) == type(None)
# Find the nans
nans = (df.isna()) & (df.applymap(type) != type(None))
# Combine the above
falses_nones_nans = falses | nones | nans
# Find columns with only nones and nones
cols = falses_nones_nans.all()[falses_nones_nans.all()].index.to_list()
# Drop these columns
df = df.drop(cols, axis=1)
# Which columns have been removed?
print(set(list(raw)) - set(list(df)))
## {'only_falses', 'falses_and_nones', 'nones_and_nans', 'falses_and_nans', 'only_nones', 'only_nans'}
All columns that contain only Falses, Nones, NaNs or a combination thereof have been removed.
Converting a copy of each column to ‘Boolean’ format will change all instances of “None” to “False”, but instances of “NaN” will be unaffected:
# Make a copy of the data frame
df = raw.copy()
# Drop the columns that only contain 'False', 'None' and 'NaN'
for col in list(df):
if not df[col].astype('bool').any():
df = df.drop(col, axis=1)
# Which columns have been removed?
print(set(list(raw)) - set(list(df)))
## {'only_falses', 'falses_and_nones', 'nones_and_zeroes', 'only_zeroes', 'only_nones', 'falses_and_zeroes'}
Columns only_falses
, only_nones
, only_zeroes
, falses_and_nones
, falses_and_zeroes
and nones_and_zeroes
have been dropped, but none that contain NaNs.
Find the Falses, NaNs and zeroes separately:
# Make a copy of the data frame
df = raw.copy()
# Find the falses
falses = (df == False) & (df.astype(str) != '0')
# Find the nans
nans = (df.isna()) & (df.applymap(type) != type(None))
# Find the zeroes
zeroes = df.astype(str) == '0'
# Combine the above
falses_nans_zeroes = falses | nans | zeroes
# Find columns with only nones and nones
cols = falses_nans_zeroes.all()[falses_nans_zeroes.all()].index.to_list()
# Drop these columns
df = df.drop(cols, axis=1)
# Which columns have been removed?
print(set(list(raw)) - set(list(df)))
## {'nans_and_zeroes', 'only_nans', 'only_falses', 'only_zeroes', 'falses_and_nans', 'falses_and_zeroes'}
Find the Nones, NaNs and zeroes separately:
# Make a copy of the data frame
df = raw.copy()
# Find the nones
nones = df.applymap(type) == type(None)
# Find the nans
nans = (df.isna()) & (df.applymap(type) != type(None))
# Find the zeroes
zeroes = df.astype(str) == '0'
# Combine the above
nones_nans_zeroes = nones | nans | zeroes
# Find columns with only nones and nones
cols = nones_nans_zeroes.all()[nones_nans_zeroes.all()].index.to_list()
# Drop these columns
df = df.drop(cols, axis=1)
# Which columns have been removed?
print(set(list(raw)) - set(list(df)))
## {'nans_and_zeroes', 'nones_and_zeroes', 'nones_and_nans', 'only_zeroes', 'only_nones', 'only_nans'}
No columns containing “False” have been removed.
The .any()
method is going to be the key to finding which columns are empty: it looks at each column to see if there are any elements in it which are:
Let’s take a look:
# Make a copy of the data frame
df = raw.copy()
print(df.any())
## only_text True
## only_trues True
## text_and_nones True
## only_falses False
## only_nones False
## only_nans False
## only_zeroes False
## falses_and_nones False
## falses_and_nans False
## falses_and_zeroes False
## nones_and_nans False
## nones_and_zeroes False
## nans_and_zeroes False
## dtype: bool
Of course, what we are actually interested in is the columns where none of the elements are True, non-NaN, non-zero or non-empty. This is the opposite of what the .any()
method looks for, so we can use the tilda (~
) operator which returns the opposite of what it was given:
print(~df.any())
## only_text False
## only_trues False
## text_and_nones False
## only_falses True
## only_nones True
## only_nans True
## only_zeroes True
## falses_and_nones True
## falses_and_nans True
## falses_and_zeroes True
## nones_and_nans True
## nones_and_zeroes True
## nans_and_zeroes True
## dtype: bool
What we have is a series indicating which columns are empty! We can use the Boolean values (the True/False values) from this series to index this series, leaving us with only the columns that need removing:
print(~df.any()[~df.any()])
## only_falses True
## only_nones True
## only_nans True
## only_zeroes True
## falses_and_nones True
## falses_and_nans True
## falses_and_zeroes True
## nones_and_nans True
## nones_and_zeroes True
## nans_and_zeroes True
## dtype: bool
Now we can take the column names (which are the indices of this series) and drop them from the original data frame:
# Drop the columns that only contain 'False', 'None', 'NaN' and '0'
subset = df.drop(df.any()[~df.any()].index, axis=1)
print(subset)
## only_text only_trues text_and_nones
## 0 Value 1 True Value 1
## 1 Value 2 True Value 2
## 2 Value 3 True None
## 3 Value 4 True None
This is the result we were looking for: the original data frame minus the columns that only contain non-values.
The same output as above can be obtained by looping over the columns and testing them individually using the .any()
method:
# Make a copy of the data frame
df = raw.copy()
# Drop the columns that only contain Falses, Nones and/or NaNs
for col in df:
if not df[col].any():
df = df.drop(col, axis=1)
# Which columns have been removed?
print(set(list(raw)) - set(list(df)))
## {'nans_and_zeroes', 'only_nans', 'only_falses', 'falses_and_nones', 'nones_and_zeroes', 'nones_and_nans', 'only_zeroes', 'falses_and_nans', 'only_nones', 'falses_and_zeroes'}
# Show the output
print(df)
## only_text only_trues text_and_nones
## 0 Value 1 True Value 1
## 1 Value 2 True Value 2
## 2 Value 3 True None
## 3 Value 4 True None
If there is a column that is of the ‘category’ type then the above methods do not always work as expected. To demonstrate, let’s convert one of the columns to this type:
# Change the type of the falses_and_nones column
raw['falses_and_nones'] = raw['falses_and_nones'].astype('category')
print(raw)
## only_text only_trues text_and_nones only_falses only_nones only_nans only_zeroes falses_and_nones falses_and_nans \
## 0 Value 1 True Value 1 False None NaN 0 False False
## 1 Value 2 True Value 2 False None NaN 0 False False
## 2 Value 3 True None False None NaN 0 NaN NaN
## 3 Value 4 True None False None NaN 0 NaN NaN
##
## falses_and_zeroes nones_and_nans nones_and_zeroes nans_and_zeroes
## 0 False None None NaN
## 1 False None None NaN
## 2 0 NaN 0 0
## 3 0 NaN 0 0
Note that the “None” values in the falses_and_nones
column has been changed to “NaN”s by the re-typing process.
Now, let’s attempt the first method we used above to drop columns containing only “False”, “None”, “NaN” and “0” (ie using df.any()
):
# Make a copy of the data frame
df = raw.copy()
# Drop the columns that only contain 'False', 'None', 'NaN' and '0'
df = df.drop(df.any()[~df.any()].index, axis=1)
# Which columns have been removed?
print(set(list(raw)) - set(list(df)))
## {'falses_and_zeroes', 'nones_and_zeroes', 'falses_and_nans', 'nans_and_zeroes', 'only_nans', 'only_zeroes', 'only_nones', 'only_falses', 'nones_and_nans'}
This has not worked as expected because falses_and_nones
(the ‘category’ column) has not been removed!
Using .fillna()
to fill all “None” and “NaN” values with “False” does not work either, regardless of whether you have bool_only=True
or not:
# Make a copy of the data frame
df = raw.copy()
# Drop the columns that only contain 'False', 'None', 'NaN' and '0'
df = df.fillna(False)
for col in list(df.any()[~df.any()].index):
df = df.drop(col, axis=1)
# Which columns have been removed?
print(set(list(raw)) - set(list(df)))
## {'only_falses', 'falses_and_nans', 'nones_and_nans', 'only_zeroes', 'nones_and_zeroes', 'only_nones', 'falses_and_zeroes', 'nans_and_zeroes', 'only_nans'}
# Make a copy of the data frame
df = raw.copy()
# Drop the columns that only contain 'False'
df = df.fillna(False)
for col in list(df.any(bool_only=True)[~df.any(bool_only=True)].index):
df = df.drop(col, axis=1)
# Which columns have been removed?
print(set(list(raw)) - set(list(df)))
## {'only_falses', 'nones_and_nans', 'falses_and_nans', 'only_nones', 'only_nans'}
In both cases, falses_and_nones
has not been removed.
Again, we will attempt the method that worked previously:
# Make a copy of the data frame
df = raw.copy()
# Find the falses
falses = (df == False) & (df.astype(str) != '0')
# Find the nones
nones = df.applymap(type) == type(None)
# Find both the falses and the nones
falses_nones = falses | nones
# Find columns with only falses and nones
cols = falses_nones.all()[falses_nones.all()].index.to_list()
# Drop these columns
df = df.drop(cols, axis=1)
# Which columns have been removed?
print(set(list(raw)) - set(list(df)))
## {'only_nones', 'only_falses'}
…and again this has not worked as expected! The falses_and_nones
column which we converted to the ‘category’ type is still not being removed.
Now let’s try the method that drops columns containing only “False” (ie using bool_only=True
):
# Make a copy of the data frame
df = raw.copy()
# Find which columns contain only Booleans and only Falses. Use the `bool_only=True`
# option with the `any()` method to have it only recognise Booleans:
bool_cols = ~df.any(bool_only=True)
only_falses = bool_cols[bool_cols]
# Get the names of these columns
cols_only_falses = only_falses.index.to_list()
# Remove these columns
df = df.drop(cols_only_falses, axis=1)
# Which columns have been removed?
print(set(list(raw)) - set(list(df)))
## {'only_falses'}
This has worked as expected!
Iterate though the columns and treat them differently depending on whether they are categorical or not:
# Make a copy of the data frame
df = raw.copy()
# Drop the columns that only contain 'False', 'None', 'NaN' and/or '0'
for col in list(df):
if str(df[col].dtype) != 'category':
# If the column is not categorical, check if it has any True values...
if not df[col].any():
# ...if not, remove it
df = df.drop(col, axis=1)
else:
# If the column is categorical, convert it to Boolean THEN check it...
if not df[col].astype('bool').any():
# ...if there are no True values, remove it
df = df.drop(col, axis=1)
# Which columns have been removed?
print(set(list(raw)) - set(list(df)))
## {'nans_and_zeroes', 'only_nans', 'only_falses', 'falses_and_nones', 'nones_and_zeroes', 'nones_and_nans', 'only_zeroes', 'falses_and_nans', 'only_nones', 'falses_and_zeroes'}
As expected, all columns that only contain “False”, “None”, “NaN” or “0” (or a combination thereof) have been removed.