VLOOKUP is a function that is used all the time in spreadsheet programmes such as Microsoft Excel or LibreOffice Calc. There is no one-line equivalent in Python’s Pandas but the same thing can be achieved in two lines by filtering then accessing:
# VLOOKUP in Python's Pandas
idx = df[df[column_name_1] == criterion].index
value = df.loc[idx, column_name_2]
In this tutorial we’ll create a custom function that mimics the format of Excel’s VLOOKUP and use it to unblind a dataset from a blinded experiment.
We’re going to use scikit-learn’s ‘Diabetes’ toy dataset. You can read about it over here but for the purpose of this tutorial all you need to know is that it contains the medical data of participants in a study. It can be downloaded from the internet using Pandas:
import pandas as pd
# Download the dataset from the internet as tab-separated variables
url = 'https://www4.stat.ncsu.edu/~boos/var.select/diabetes.tab.txt'
df = pd.read_csv(url, sep='\t')
Let’s create participant IDs from the data frame’s index:
# Create participant numbers
df['ID'] = 'Participant ' + (df.index + 1).astype(str)
Now we can take a look at it:
print(df.head())
## AGE SEX BMI BP S1 S2 S3 S4 S5 S6 Y ID
## 0 59 2 32.1 101.0 157 93.2 38.0 4.0 4.8598 87 151 Participant 1
## 1 48 1 21.6 87.0 183 103.2 70.0 3.0 3.8918 69 75 Participant 2
## 2 72 2 30.5 93.0 156 93.6 41.0 4.0 4.6728 85 141 Participant 3
## 3 24 1 25.3 84.0 198 131.4 40.0 5.0 4.8903 89 206 Participant 4
## 4 50 1 23.0 101.0 192 125.4 52.0 4.0 4.2905 80 135 Participant 5
As you can see, we have information and measurements for each person but do not know their names. To unblind the data we need a ‘key’ that links the participants’ IDs to their real names.
Create a data frame that contains both the participants’ blinded IDs and their real names. In a real-life scenario, this might be a spreadsheet that you import into Python using read_csv()
or read_excel()
but for this example we will create it manually. I’m not creative enough to come up with real-sounding names so I’ll use a package called names
to randomly generate them for me. You can download this for yourself by running the following from the terminal (replace 3.10
with the version of Python you are using):
python3.10 -m pip install names
To ensure we get the same random names each time we run this code, we need to set the random package’s ‘seed’ to a fixed value:
import names
import random
# Get the same random names every time
random.seed(20220824)
# Create a data frame from a dictionary
dct = {
'blinded_name': ['Participant 1', 'Participant 2', 'Participant 3', 'Participant 4', 'Participant 5'],
'unblinded_name': [
names.get_full_name(gender='female'),
names.get_full_name(gender='male'),
names.get_full_name(gender='female'),
names.get_full_name(gender='male'),
names.get_full_name(gender='male'),
],
}
key = pd.DataFrame(dct)
print(key)
## blinded_name unblinded_name
## 0 Participant 1 Gina Contreras
## 1 Participant 2 Dale Casanova
## 2 Participant 3 Jill Cooper
## 3 Participant 4 Donald Ballenger
## 4 Participant 5 Martin Hogarth
In Excel, Calc or Google Sheets, the VLOOKUP function takes three inputs:
We can replicate this as follows:
def vlookup(search_criterion, array, index):
"""
Perform a VLOOKUP operation.
This uses Microsoft Excel's/LibreOffice Calc's format for the VLOOKUP function.
"""
# Lookup patient IDs in column 0
idx = array[array.iloc[:, 0] == search_criterion].index
# Lookup corresponding values in column 1
try:
value = array.iloc[idx, index].values[0]
except IndexError:
value = None
return value
In our example, we want to search the first column of the key
data frame to find the participants’ IDs, then look up the corresponding real names in the second column (which is column ‘1’ because Python is zero-indexed). To do this, we apply our vlookup()
function to the participants’ IDs, which automatically causes these IDs to be interpreted as the search criteria. We then need to specify the remaining two arguments: the key
data frame as the array to look at and 1
as the index of the column within key
that contains the data we are looking for:
df['NAME'] = df['ID'].apply(vlookup, args=(key, 1))
print(df.head())
## AGE SEX BMI BP S1 S2 S3 S4 S5 S6 Y ID NAME
## 0 59 2 32.1 101.0 157 93.2 38.0 4.0 4.8598 87 151 Participant 1 Gina Contreras
## 1 48 1 21.6 87.0 183 103.2 70.0 3.0 3.8918 69 75 Participant 2 Dale Casanova
## 2 72 2 30.5 93.0 156 93.6 41.0 4.0 4.6728 85 141 Participant 3 Jill Cooper
## 3 24 1 25.3 84.0 198 131.4 40.0 5.0 4.8903 89 206 Participant 4 Donald Ballenger
## 4 50 1 23.0 101.0 192 125.4 52.0 4.0 4.2905 80 135 Participant 5 Martin Hogarth
We now know the name of the person each row of data came from.