Text can be extracted from PDF documents using the pypdf package (read more on PyPI, GitHub and Read the Docs). We will also be using the Pandas library. Install these both from the terminal with the following commands:
# "python3.12" should correspond to the version of Python you are using
python3.12 -m pip install pypdf
python3.12 -m pip install pandas
Once finished, you can import them into a Python script as follows:
import pypdf
import pandas as pd
As an example of a PDF from which to extract text let’s use Bland and Altman’s 1986 paper “Statistical methods for assessing agreement between two methods of clinical measurement”1. This is the 29th most-cited paper of all time, so it’s quite a popular one! It can be downloaded from here, which is a part of Bland’s own site.
Once you’ve downloaded the PDF, move it into the same folder as your Python script and you’ll be able to use pypdf from there:
# Download this PDF from https://www-users.york.ac.uk/~mb55/meas/ba.pdf
pdf = pypdf.PdfReader('ba.pdf')
Once imported into Python, the number of pages in the PDF can be accessed as an attribute:
number_of_pages = len(pdf.pages)
print(f'Number of pages in this PDF: {number_of_pages}')
## Number of pages in this PDF: 9
Get the first of these pages, extract the text and split it into individual lines:
page = pdf.pages[0]
text = page.extract_text()
lines = text.split('\n')
for line in lines[:3]:
print(line)
## 1STATISTICAL METHODS FOR ASSESSING AGREEMENT
## BETWEEN TWO METHODS OF CLINICAL MEASUREMENT
## J. Martin Bland, Douglas G. Altman
You’ll notice that the page number appears at the start of the first line. This is annoying, but it’s a manageable glitch.
Next, let’s access the raw data on page 2:
page = pdf.pages[1]
text = page.extract_text()
lines = text.split('\n')
for line in lines[1:21]:
print(line)
## Wright peak flow meter Mini Wright peak flow meter
## First PEFR Second PEFR First PEFR Second PEFR
## Subject (l/min) (l/mi) (l/min) (l/min)
## 1 494 490 512 525
## 2 395 397 430 415
## 3 516 512 520 508
## 4 434 401 428 444
## 5 476 470 500 500
## 6 557 611 600 625
## 7 413 415 364 460
## 8 442 431 380 390
## 9 650 638 658 642
## 10 433 429 445 432
## 11 417 420 432 420
## 12 656 633 626 605
## 13 267 275 260 227
## 14 478 492 477 467
## 15 178 165 259 268
## 16 423 372 350 370
## 17 427 421 451 443
This is great! We can import the data into Python directly from the PDF! Of course, it will be more usable in a data frame format so let’s use Pandas and re-structure it:
# Initialise the output data frame
df = pd.DataFrame()
# Iterate over the lines extracted from the PDF
for line in lines[4:21]:
# Remove duplicate white space
line = ' '.join(line.split())
# Split into its elements
line = line.split()
# Construct a new row with a MultiIndex as the column names
data = [[line[1], line[2], line[3], line[4]]]
arrays = [
['Wright'] * 2 + ['Mini Wright'] * 2,
['First', 'Second'] * 2,
]
tuples = list(zip(*arrays))
columns = pd.MultiIndex.from_tuples(tuples)
new_row = pd.DataFrame(data, index=[line[0]], columns=columns)
# Add to master data frame
df = pd.concat([df, new_row])
print(df)
## Wright Mini Wright
## First Second First Second
## 1 494 490 512 525
## 2 395 397 430 415
## 3 516 512 520 508
## 4 434 401 428 444
## 5 476 470 500 500
## 6 557 611 600 625
## 7 413 415 364 460
## 8 442 431 380 390
## 9 650 638 658 642
## 10 433 429 445 432
## 11 417 420 432 420
## 12 656 633 626 605
## 13 267 275 260 227
## 14 478 492 477 467
## 15 178 165 259 268
## 16 423 372 350 370
## 17 427 421 451 443
See here for how to continue the analysis of Bland and Altman’s data.
Bland JM, Altman DG. Statistical methods for assessing agreement between two methods of clinical measurement. Lancet. 1986;327(8476): 307–310. DOI: 10.1016/s0140-6736(86)90837-8. PMID: 2868172. Available here. Jump to reference: ↩︎