2 Example

As an example of a PDF from which to extract text let’s use Bland and Altman’s 1986 paper “Statistical methods for assessing agreement between two methods of clinical measurement”¹. This is the 29th most-cited paper of all time, so it’s quite a popular one! It can be downloaded from here, which is a part of Bland’s own site.

Once you’ve downloaded the PDF, move it into the same folder as your Python script and you’ll be able to use pypdf from there:

# Download this PDF from https://www-users.york.ac.uk/~mb55/meas/ba.pdf
pdf = pypdf.PdfReader('ba.pdf')

Once imported into Python, the number of pages in the PDF can be accessed as an attribute:

number_of_pages = len(pdf.pages)
print(f'Number of pages in this PDF: {number_of_pages}')

## Number of pages in this PDF: 9

Get the first of these pages, extract the text and split it into individual lines:

page = pdf.pages[0]
text = page.extract_text()
lines = text.split('\n')
for line in lines[:3]:
    print(line)

## 1STATISTICAL METHODS FOR ASSESSING AGREEMENT
## BETWEEN TWO METHODS OF CLINICAL MEASUREMENT
## J. Martin Bland, Douglas G. Altman

You’ll notice that the page number appears at the start of the first line. This is annoying, but it’s a manageable glitch.

Next, let’s access the raw data on page 2:

page = pdf.pages[1]
text = page.extract_text()
lines = text.split('\n')
for line in lines[1:21]:
    print(line)

##          Wright peak flow meter    Mini Wright peak flow meter
##          First PEFR  Second PEFR   First PEFR  Second PEFR
## Subject   (l/min)      (l/mi)       (l/min)      (l/min)
##  1         494          490          512          525
##  2         395          397          430          415
##  3         516          512          520          508
##  4         434          401          428          444
##  5         476          470          500          500
##  6         557          611          600          625
##  7         413          415          364          460
##  8         442          431          380          390
##  9         650          638          658          642
## 10         433          429          445          432
## 11         417          420          432          420
## 12         656          633          626          605
## 13         267          275          260          227
## 14         478          492          477          467
## 15         178          165          259          268
## 16         423          372          350          370
## 17         427          421          451          443

This is great! We can import the data into Python directly from the PDF! Of course, it will be more usable in a data frame format so let’s use Pandas and re-structure it:

# Initialise the output data frame
df = pd.DataFrame()
# Iterate over the lines extracted from the PDF
for line in lines[4:21]:
    # Remove duplicate white space
    line = ' '.join(line.split())
    # Split into its elements
    line = line.split()
    # Construct a new row with a MultiIndex as the column names
    data = [[line[1], line[2], line[3], line[4]]]
    arrays = [
        ['Wright'] * 2 + ['Mini Wright'] * 2,
        ['First', 'Second'] * 2,
    ]
    tuples = list(zip(*arrays))
    columns = pd.MultiIndex.from_tuples(tuples)
    new_row = pd.DataFrame(data, index=[line[0]], columns=columns)
    # Add to master data frame
    df = pd.concat([df, new_row])
print(df)

##    Wright        Mini Wright       
##     First Second       First Second
## 1     494    490         512    525
## 2     395    397         430    415
## 3     516    512         520    508
## 4     434    401         428    444
## 5     476    470         500    500
## 6     557    611         600    625
## 7     413    415         364    460
## 8     442    431         380    390
## 9     650    638         658    642
## 10    433    429         445    432
## 11    417    420         432    420
## 12    656    633         626    605
## 13    267    275         260    227
## 14    478    492         477    467
## 15    178    165         259    268
## 16    423    372         350    370
## 17    427    421         451    443

See here for how to continue the analysis of Bland and Altman’s data.

⇦ Back

Bland JM, Altman DG. Statistical methods for assessing agreement between two methods of clinical measurement. Lancet. 1986;327(8476): 307–310. DOI: 10.1016/s0140-6736(86)90837-8. PMID: 2868172. Available here. Jump to reference: ↩︎

Utilities in Python:
Extract Text From PDFs

1 Python Packages

2 Example

Utilities in Python:Extract Text From PDFs

1 Python Packages

2 Example

Utilities in Python:
Extract Text From PDFs