PDF text extraction guide with Python

Oghenerukevwe Henrietta Kofi

Updated: August 12, 2024

Extracting text from PDF files is a common task in Python, whether you’re dealing with simple or encrypted documents. In this tutorial, you’ll explore how to extract text from PDFs using Python with the help of open source libraries like PyPDF and powerful APIs like PSPDFKit’s Python PDF API. Whether you’re a beginner or an experienced developer, this post will walk you through every step of the process.

It’s important to note that there are two different types of text extraction:

Extracting text that’s already selectable in a PDF viewer. PDFs are usually made up of text authored in a word processing program.
Extracting text from an image-based PDF document. PDFs are typically made up of images from documents that are scanned.

This post will focus on extracting text that’s already selectable.

The Benefits of Using Python for Text Extraction

Python is a versatile language with a rich ecosystem of libraries, making it an excellent choice for text extraction tasks. The flexibility of Python allows developers to choose from a variety of tools depending on the complexity of the task. For simple extraction tasks, lightweight libraries like PyPDF offer an easy-to-use interface, while more advanced needs can be met with robust APIs like PSPDFKit.

Python’s popularity in data processing and automation also means that integrating PDF text extraction into larger workflows is straightforward. Additionally, Python’s extensive documentation and community support make it easier to troubleshoot and extend functionality as needed.

Requirements

This tutorial will make use of Python version 3.12.3, but it should work with most 3.x Python versions. Create a new folder and a Python file to store all the code from this tutorial:

mkdir text_extract_pdf
cd text_extract_pdf
touch app.py

You’ll also need to install PyPDF(opens in a new tab). You’ll rely on this library to read a PDF file and extract data from it. It can easily be installed using PIP:

pip install pypdf

The tutorial will make use of two example PDF files to demonstrate the code, but you can use whichever PDF file you prefer while following along: file 1(opens in a new tab) and file 2(opens in a new tab). Just make sure to save the PDF file next to the app.py file and replace the file names in the rest of this tutorial appropriately.

Extracting Text from PDF Using PyPDF in Python

Open the app.py file and type the following code:

from pypdf import PdfReader

reader = PdfReader("compressed.tracemonkey-pldi-09.pdf")
for page in reader.pages:
    print(page.extract_text())

When you save and run the code, it’ll print all the text from the PDF file in the terminal. The code creates a PdfReader(opens in a new tab) object. Then it loops over all the pages in the PDF using the .pages(opens in a new tab) property and prints the text from each page using the .extract_text(opens in a new tab) method.

Skipping Headers and Footers with PyPDF

PyPDF allows you to use visitor functions that get called with each operator or text fragment. The visitor function receives five arguments: the text, the current transformation matrix, the text matrix, the font dictionary, and the font size. You can make use of the text matrix to figure out the x/y coordinates of the text fragment and decide if you want to skip it or extract it.

In the following example, PyPDF will skip the header and footer of this PDF document(opens in a new tab), as they fall outside of the acceptable y coordinate range:

from pypdf import PdfReader

reader = PdfReader("GeoBase_NHNC1_Data_Model_UML_EN.pdf")
page = reader.pages[3]

parts = []

def visitor_body(text, cm, tm, fontDict, fontSize):
    y = tm[5]
    if y > 50 and y < 720:
        parts.append(text)

page.extract_text(visitor_text=visitor_body)
print("".join(parts))

Decrypting and Extracting Text from Encrypted PDFs in Python

The PDF files you’re working with may be encrypted. Luckily, you don’t have to look anywhere else for a solution, as PyPDF supports encryption and decryption of PDF files as well.

To work with encrypted documents, you’ll need to install the cryptography package:

pip install cryptography

Use the .decrypt method to decrypt a PDF file before extracting text from it:

from pypdf import PdfReader

reader = PdfReader("encrypted-pdf.pdf")

if reader.is_encrypted:
    reader.decrypt("password")

# extract text from all pages
for page in reader.pages:
    print(page.extract_text())

Using PSPDFKit API to Extract Text from PDF Files in Python

This section will cover how you can extract text with PSPDFKit API.

First, go to our website and create your free account. You’ll see the page below.

Free account PSPDFKit API

After you’ve verified your email, you’ll have access to your API key. Navigate to the Overview page to get started, or go to API Keys to retrieve your key.

Image showing navigation to API keys on PSPDFKit API’s dashboard

To work with PSPDFKit API, you’ll need to install the requests package:

pip install requests

After installing the package, you can create a Python script to perform text extraction using the API’s /build endpoint:

import json
import requests

file = "./example.pdf"

url = "https://api.pspdfkit.com/build"

payload= {
  "instructions": json.dumps({
    "parts": [
      {
        "file": "file"
      }
    ],
    "output": {
        "type": "json-content",
        "plainText": True,
        "structuredText": True,
    }
})}

files=[
  ('file',('file.pdf',open(file,'rb'),'application/pdf')),
]
headers = {
  'Authorization': 'Bearer <API-KEY>'
}

response = requests.post(url, headers = headers, data = payload, files = files)

if response.status_code == 200:
  print(response.content)
else:
  print(
    f"Request to PSPDFKit API failed with status code {response.status_code}: '{response.text}'."
  )

Be sure to replace <API-KEY> in the code above with your key from the PSPDFKit API dashboard. Also ensure that an actual PDF file is present at the path specified by the file variable on line 4.

You can perform many operations using PSPDFKit API, including text extraction, Office conversion, and OCR. Learn more by reading our documentation.

Comparing PyPDF and PSPDFKit for Text Extraction

When it comes to extracting text from PDF files, both PyPDF and PSPDFKit are powerful tools, but they serve different needs.

PyPDF

Open Source — PyPDF is an open source library, making it a cost-effective choice for developers working on projects with budget constraints.
Lightweight and Easy to Use — PyPDF is simple to integrate into Python projects and works well for basic text extraction tasks.
Community-Driven — As an open source project, PyPDF benefits from community contributions and updates, but it might lack the advanced features of commercial tools.

PSPDFKit

Advanced Features — PSPDFKit is a commercial API that offers advanced features like high-fidelity text extraction, handling of complex PDFs, and support for encrypted documents.
Security and Compliance — PSPDFKit provides SOC 2-compliant security, making it a suitable choice for enterprise applications where data security is a priority.
Comprehensive Support — With PSPDFKit, users benefit from professional support and regular updates, ensuring reliability and performance in production environments.

In summary, PyPDF is ideal for simpler, budget-conscious projects, while PSPDFKit is the go-to solution for enterprise-level applications requiring advanced capabilities and security.

Conclusion

This tutorial covered the basics of extracting text from a PDF file using Python and PyPDF. It also showed how to extract text from an encrypted PDF file.

The second part of the tutorial introduced PSPDFKit API as an alternative solution for extracting text from a PDF. Leveraging the power of PSPDFKit API, you can efficiently and easily extract meaningful text from PDF files while ensuring high extraction speed and quality.

FAQ

How can I extract text from a PDF using Python?

You can use libraries like PyPDF for basic text extraction and PSPDFKit for more advanced features, including handling encrypted PDFs.

How do I install PyPDF for text extraction?

Install PyPDF with the command pip install pypdf in your terminal or command prompt.

How can I decrypt an encrypted PDF in Python?

Use the decrypt method from PyPDF after installing the cryptography package with pip install cryptography.

What are the benefits of using PSPDFKit for text extraction?

PSPDFKit offers advanced text extraction features and high-quality results, and it supports encrypted PDFs, making it suitable for more complex tasks.

What should I check if my PDF extraction code isn’t working?

Ensure file paths are correct, verify your API key if using an external service, and check for any error messages in your console.