Summarize a PDF document using machine learning and natural language processing
This article was first published in June 2022 and was updated in August 2024.
When it comes to retrieving and storing information from the internet, there’s an extensive choice of tools to help collect and compile it. The list ranges from the simplest things, like browser bookmarks and physical notes, to more complicated software such as mind maps and workspaces for personal knowledge management that use databases and Markdown pages.
We’re exposed to a constant flow of news, information, and messages from different sources, so it’s becoming more and more important to be able to quickly and efficiently organize and distill what’s useful for us.
The power of conciseness
Being able to condense a topic into a few pages (or lines) is considered a superpower for people who want to organize valuable information in a short amount of time. And up until a few years ago, the process of producing a summary from a given text was considered a task that could only be successfully completed by humans, and not by computers.
However, something like this is now possible thanks to things like machine learning, natural language processing, and more. The next sections will provide a brief overview of these technologies before delving into how they can help us.
Rather than reinvent the wheel, these sections will include definitions compiled from other websites, as they do a better job than we can of describing these technologies.
Machine learning at your service
Machine learning (ML) is a field that’s considered part of artificial intelligence (AI). According to Wikipedia, “Machine learning algorithms are used in a wide variety of applications, such as in medicine, email filtering, speech recognition, and computer vision, where it is difficult or unfeasible to develop conventional algorithms to perform the needed tasks.”
Natural language processing, the technology you didn’t know
Natural language processing is another subfield of AI that, according to Wikipedia, is “concerned with the interactions between computers and human language, in particular how to program computers to process and analyze large amounts of natural language data.”
Natural language processing dates back to the 1950s, and some great minds, like Alan Turing, worked on automated interpretation and generation of natural language.
BART, the model for text comprehension tasks
No, not that Bart! The BART model was proposed for the first time on 29 October 2019 in the research paper entitled BART: Denoising Sequence-to-Sequence Pre-training for Natural Language Generation, Translation, and Comprehension by Mike Lewis, Yinhan Liu, Naman Goyal, Marjan Ghazvininejad, Abdelrahman Mohamed, Omer Levy, Ves Stoyanov, and Luke Zettlemoyer.
According to Hugging Face, “BART uses a standard seq2seq/machine translation architecture with a bidirectional encoder, like BERT, and a left-to-right decoder, like GPT.”
So what does that mean? Well, the paper’s abstract states that, “BART is particularly effective when fine tuned for text generation but also works well for comprehension tasks. It matches the performance of RoBERTa with comparable training resources on GLUE and SQuAD, achieves new state-of-the-art results on a range of abstractive dialogue, question answering, and summarization tasks, with gains of up to 6 ROUGE.”
Summarizing a PDF document the quickest possible way
OK enough with the jargon and boring details. Data extraction from PDF documents is already a tricky topic for most. So what we want to know is: How hard would it be to generate a summary of a PDF document without too much hassle?
The answer is: It can be done using fewer than 20 lines of code in Python! Using NLP for ML and the BART model, we can easily achieve the task of summarizing a PDF document written in English.
Getting started: Installation and setup
To dive into summarizing PDFs with machine learning, you’ll need a few key tools. Read on to get everything set up step by step!
-
Install Python
First things first: make sure you have Python installed on your computer. Download it from python.org if you haven’t already.
-
Set up a virtual environment
Virtual environments keep your project dependencies organized. Here’s how to create one:
python -m venv venv
Now, activate it.
- On Windows: ```bash venv\Scripts\activate ``` - On macOS/Linux: ```bash source venv/bin/activate ```
-
Install the required libraries
Next, you’ll need a couple of libraries: pdfplumber
for extracting text from PDFs, and transformers
with torch
for summarization. Install them with:
pip install pdfplumber transformers torch
-
Verify the installation
Ensure everything is installed correctly by checking if you can import the libraries:
import pdfplumber from transformers import BartTokenizer, BartForConditionalGeneration import torch
Extract and summarize text from a PDF
Now that you’re set up, it’s time for the fun part: extracting text from a PDF and summarizing it with just a few lines of code.
Extracting text from a PDF
As a first step, you need to extract the text you want to process from a PDF document. For this task, you can use the Python library pdfplumber
.
Save the following code into a Python file — for example, summarize_pdf.py
:
import pdfplumber # Open and extract text from the PDF. with pdfplumber.open(r'document.pdf') as pdf: extracted_page = pdf.pages[1] # Access the second page (index 1). extracted_text = extracted_page.extract_text() print(extracted_text)
Make sure document.pdf
is in the same directory as your script or provide the full path.
Summarize the extracted text
Next, you can use the transformers
library offered by Hugging Face and the BART tokenizer with a distilled BART model specifically trained for text summarization. The code below extracts the text and assigns it to the extracted_text
variable:
from transformers import BartTokenizer, BartForConditionalGeneration # Load the pretrained BART model and tokenizer. model = BartForConditionalGeneration.from_pretrained('sshleifer/distilbart-cnn-12-6') tokenizer = BartTokenizer.from_pretrained('sshleifer/distilbart-cnn-12-6') # Tokenize the extracted text. inputs = tokenizer([extracted_text], truncation=True, return_tensors='pt') # Generate a summary. summary_ids = model.generate(inputs['input_ids'], num_beams=4, early_stopping=True, min_length=0, max_length=1024) summarized_text = [tokenizer.decode(g, skip_special_tokens=True, clean_up_tokenization_spaces=True) for g in summary_ids] print(summarized_text[0])
Running the script
Execute your Python script to see the magic happen:
python summarize_pdf.py
And voilà, you’ll see the extracted text from the PDF and a concise summary generated by the BART model. 😎
The full implementation can be found in this Colab file. Google Colab is a free Jupyter notebook environment that runs entirely in the cloud. Executing all the steps starting from the beginning will give you the possibility of loading a document and checking the actual output in real time.
Conclusion
The goal of this blog post was to introduce you to machine learning and natural language processing applied to PDF documents, showing the quickest possible way to summarize a text. This task requires a wide skillset, and depending on the type of language involved (e.g. something scientific, academic, or conversational), the type of model you need will vary.
If you feel you can do it in fewer than 20 lines of code, contact me on Twitter and I’ll be happy to look at your solution. 😀
FAQ
Here are a few frequently asked questions about summarizing PDFs.
What is PDF summarization?
PDF summarization is the process of condensing lengthy PDF documents into brief summaries using AI and natural language processing (NLP) models.
How does BART help in summarizing PDFs?
BART is a transformer-based NLP model that can generate concise summaries from text extracted from PDFs by interpreting and condensing the content effectively.
Which libraries are needed for PDF summarization in Python?
You’ll need pdfplumber
for text extraction and transformers
with torch
for summarization using BART.
Is a virtual environment necessary?
Using a virtual environment is recommended to manage dependencies without conflicts, especially when installing multiple Python packages.
Can I run this summarization code on Google Colab?
Yes, Google Colab supports all required libraries and allows you to test the summarization process directly in the cloud.