⇦ Back

1 Python Packages

Text can be extracted from PDF documents using the pypdf package (read more on PyPI, GitHub and Read the Docs). We will also be using the Pandas library. Install these both from the terminal with the following commands:

# "python3.12" should correspond to the version of Python you are using
python3.12 -m pip install pypdf
python3.12 -m pip install pandas

Once finished, you can import them into a Python script as follows:

import pypdf
import pandas as pd

2 Example

As an example of a PDF from which to extract text let’s use Bland and Altman’s 1986 paper “Statistical methods for assessing agreement between two methods of clinical measurement”1. This is the 29th most-cited paper of all time, so it’s quite a popular one! It can be downloaded from here, which is a part of Bland’s own site.

Once you’ve downloaded the PDF, move it into the same folder as your Python script and you’ll be able to use pypdf from there:

# Download this PDF from https://www-users.york.ac.uk/~mb55/meas/ba.pdf
pdf = pypdf.PdfReader('ba.pdf')

Once imported into Python, the number of pages in the PDF can be accessed as an attribute:

number_of_pages = len(pdf.pages)
print(f'Number of pages in this PDF: {number_of_pages}')
## Number of pages in this PDF: 9

Get the first of these pages, extract the text and split it into individual lines:

page = pdf.pages[0]
text = page.extract_text()
lines = text.split('\n')
for line in lines[:3]:
    print(line)
## 1STATISTICAL METHODS FOR ASSESSING AGREEMENT
## BETWEEN TWO METHODS OF CLINICAL MEASUREMENT
## J. Martin Bland, Douglas G. Altman

You’ll notice that the page number appears at the start of the first line. This is annoying, but it’s a manageable glitch.

Next, let’s access the raw data on page 2:

page = pdf.pages[1]
text = page.extract_text()
lines = text.split('\n')
for line in lines[1:21]:
    print(line)
##          Wright peak flow meter    Mini Wright peak flow meter
##          First PEFR  Second PEFR   First PEFR  Second PEFR
## Subject   (l/min)      (l/mi)       (l/min)      (l/min)
##  1         494          490          512          525
##  2         395          397          430          415
##  3         516          512          520          508
##  4         434          401          428          444
##  5         476          470          500          500
##  6         557          611          600          625
##  7         413          415          364          460
##  8         442          431          380          390
##  9         650          638          658          642
## 10         433          429          445          432
## 11         417          420          432          420
## 12         656          633          626          605
## 13         267          275          260          227
## 14         478          492          477          467
## 15         178          165          259          268
## 16         423          372          350          370
## 17         427          421          451          443

This is great! We can import the data into Python directly from the PDF! Of course, it will be more usable in a data frame format so let’s use Pandas and re-structure it:

# Initialise the output data frame
df = pd.DataFrame()
# Iterate over the lines extracted from the PDF
for line in lines[4:21]:
    # Remove duplicate white space
    line = ' '.join(line.split())
    # Split into its elements
    line = line.split()
    # Construct a new row with a MultiIndex as the column names
    data = [[line[1], line[2], line[3], line[4]]]
    arrays = [
        ['Wright'] * 2 + ['Mini Wright'] * 2,
        ['First', 'Second'] * 2,
    ]
    tuples = list(zip(*arrays))
    columns = pd.MultiIndex.from_tuples(tuples)
    new_row = pd.DataFrame(data, index=[line[0]], columns=columns)
    # Add to master data frame
    df = pd.concat([df, new_row])
print(df)
##    Wright        Mini Wright       
##     First Second       First Second
## 1     494    490         512    525
## 2     395    397         430    415
## 3     516    512         520    508
## 4     434    401         428    444
## 5     476    470         500    500
## 6     557    611         600    625
## 7     413    415         364    460
## 8     442    431         380    390
## 9     650    638         658    642
## 10    433    429         445    432
## 11    417    420         432    420
## 12    656    633         626    605
## 13    267    275         260    227
## 14    478    492         477    467
## 15    178    165         259    268
## 16    423    372         350    370
## 17    427    421         451    443

See here for how to continue the analysis of Bland and Altman’s data.

⇦ Back


  1. Bland JM, Altman DG. Statistical methods for assessing agreement between two methods of clinical measurement. Lancet. 1986;327(8476): 307–310. DOI: 10.1016/s0140-6736(86)90837-8. PMID: 2868172. Available here. Jump to reference: ↩︎