Extracting text from PDF files is a common task in Python, whether you’re dealing with simple or encrypted documents. In this tutorial, you’ll explore how to extract text from PDFs using Python with the help of open source libraries like PyPDF and powerful APIs like PSPDFKit’s Python PDF API. Whether you’re a beginner or an experienced developer, this post will walk you through every step of the process.
It’s important to note that there are two different types of text extraction:
-
Extracting text that’s already selectable in a PDF viewer. PDFs are usually made up of text authored in a word processing program.
-
Extracting text from an image-based PDF document. PDFs are typically made up of images from documents that are scanned.
This post will focus on extracting text that’s already selectable.
The Benefits of Using Python for Text Extraction
Python is a versatile language with a rich ecosystem of libraries, making it an excellent choice for text extraction tasks. The flexibility of Python allows developers to choose from a variety of tools depending on the complexity of the task. For simple extraction tasks, lightweight libraries like PyPDF offer an easy-to-use interface, while more advanced needs can be met with robust APIs like PSPDFKit.
Python’s popularity in data processing and automation also means that integrating PDF text extraction into larger workflows is straightforward. Additionally, Python’s extensive documentation and community support make it easier to troubleshoot and extend functionality as needed.
Requirements
This tutorial will make use of Python version 3.12.3, but it should work with most 3.x Python versions. Create a new folder and a Python file to store all the code from this tutorial:
mkdir text_extract_pdf cd text_extract_pdf touch app.py
You’ll also need to install PyPDF. You’ll rely on this library to read a PDF file and extract data from it. It can easily be installed using PIP:
pip install pypdf
The tutorial will make use of two example PDF files to demonstrate the code, but you can use whichever PDF file you prefer while following along: file 1 and file 2. Just make sure to save the PDF file next to the app.py
file and replace the file names in the rest of this tutorial appropriately.
Extracting Text from PDF Using PyPDF in Python
Open the app.py
file and type the following code:
from pypdf import PdfReader reader = PdfReader("compressed.tracemonkey-pldi-09.pdf") for page in reader.pages: print(page.extract_text())
When you save and run the code, it’ll print all the text from the PDF file in the terminal. The code creates a PdfReader
object. Then it loops over all the pages in the PDF using the .pages
property and prints the text from each page using the .extract_text
method.
Skipping Headers and Footers with PyPDF
PyPDF allows you to use visitor functions that get called with each operator or text fragment. The visitor function receives five arguments: the text, the current transformation matrix, the text matrix, the font dictionary, and the font size. You can make use of the text matrix to figure out the x/y coordinates of the text fragment and decide if you want to skip it or extract it.
In the following example, PyPDF will skip the header and footer of this PDF document, as they fall outside of the acceptable y coordinate range:
from pypdf import PdfReader reader = PdfReader("GeoBase_NHNC1_Data_Model_UML_EN.pdf") page = reader.pages[3] parts = [] def visitor_body(text, cm, tm, fontDict, fontSize): y = tm[5] if y > 50 and y < 720: parts.append(text) page.extract_text(visitor_text=visitor_body) print("".join(parts))
Decrypting and Extracting Text from Encrypted PDFs in Python
The PDF files you’re working with may be encrypted. Luckily, you don’t have to look anywhere else for a solution, as PyPDF supports encryption and decryption of PDF files as well.
To work with encrypted documents, you’ll need to install the cryptography
package:
pip install cryptography
Use the .decrypt
method to decrypt a PDF file before extracting text from it:
from pypdf import PdfReader reader = PdfReader("encrypted-pdf.pdf") if reader.is_encrypted: reader.decrypt("password") # extract text from all pages for page in reader.pages: print(page.extract_text())
Using PSPDFKit API to Extract Text from PDF Files in Python
This section will cover how you can extract text with PSPDFKit API.
First, go to our website and create your free account. You’ll see the page below.
After you’ve verified your email, you’ll have access to your API key. Navigate to the Overview page to get started, or go to API Keys to retrieve your key.
To work with PSPDFKit API, you’ll need to install the requests
package:
pip install requests
After installing the package, you can create a Python script to perform text extraction using the API’s /build
endpoint:
import json import requests file = "./example.pdf" url = "https://api.pspdfkit.com/build" payload= { "instructions": json.dumps({ "parts": [ { "file": "file" } ], "output": { "type": "json-content", "plainText": True, "structuredText": True, } })} files=[ ('file',('file.pdf',open(file,'rb'),'application/pdf')), ] headers = { 'Authorization': 'Bearer <API-KEY>' } response = requests.post(url, headers = headers, data = payload, files = files) if response.status_code == 200: print(response.content) else: print( f"Request to PSPDFKit API failed with status code {response.status_code}: '{response.text}'." )
Be sure to replace <API-KEY>
in the code above with your key from the PSPDFKit API dashboard. Also ensure that an actual PDF file is present at the path specified by the file
variable on line 4.
You can perform many operations using PSPDFKit API, including text extraction, Office conversion, and OCR. Learn more by reading our documentation.
Comparing PyPDF and PSPDFKit for Text Extraction
When it comes to extracting text from PDF files, both PyPDF and PSPDFKit are powerful tools, but they serve different needs.
PyPDF
-
Open Source — PyPDF is an open source library, making it a cost-effective choice for developers working on projects with budget constraints.
-
Lightweight and Easy to Use — PyPDF is simple to integrate into Python projects and works well for basic text extraction tasks.
-
Community-Driven — As an open source project, PyPDF benefits from community contributions and updates, but it might lack the advanced features of commercial tools.
PSPDFKit
-
Advanced Features — PSPDFKit is a commercial API that offers advanced features like high-fidelity text extraction, handling of complex PDFs, and support for encrypted documents.
-
Security and Compliance — PSPDFKit provides SOC 2-compliant security, making it a suitable choice for enterprise applications where data security is a priority.
-
Comprehensive Support — With PSPDFKit, users benefit from professional support and regular updates, ensuring reliability and performance in production environments.
In summary, PyPDF is ideal for simpler, budget-conscious projects, while PSPDFKit is the go-to solution for enterprise-level applications requiring advanced capabilities and security.
Conclusion
This tutorial covered the basics of extracting text from a PDF file using Python and PyPDF. It also showed how to extract text from an encrypted PDF file.
The second part of the tutorial introduced PSPDFKit API as an alternative solution for extracting text from a PDF. Leveraging the power of PSPDFKit API, you can efficiently and easily extract meaningful text from PDF files while ensuring high extraction speed and quality.
FAQ
Here are a few frequently asked questions about extracting text from PDFs using Python.
How can I extract text from a PDF using Python?
You can use libraries likePyPDF
for basic text extraction and PSPDFKit
for more advanced features, including handling encrypted PDFs.
How do I install PyPDF for text extraction?
InstallPyPDF
with the command pip install pypdf
in your terminal or command prompt.
How can I decrypt an encrypted PDF in Python?
Use thedecrypt
method from PyPDF
after installing the cryptography
package with pip install cryptography
.
What are the benefits of using PSPDFKit for text extraction?
PSPDFKit
offers advanced text extraction features and high-quality results, and it supports encrypted PDFs, making it suitable for more complex tasks.
What should I check if my PDF extraction code isn’t working?
Ensure file paths are correct, verify your API key if using an external service, and check for any error messages in your console.Rukky joined Nutrient as an intern in 2022 and is currently a software engineer on the Server and Services Team. She’s passionate about building great software, and in her spare time, she enjoys reading cheesy novels, watching films, and playing video games.