Blog Post

Summarize a PDF Document Using Machine Learning and Natural Language Processing

Illustration: Summarize a PDF Document Using Machine Learning and Natural Language Processing
Information

This article was first published in June 2022 and was updated in August 2024.

When it comes to retrieving and storing information from the internet, there’s an extensive choice of tools to help collect and compile it. The list ranges from the simplest things, like browser bookmarks and physical notes, to more complicated software such as mind maps and workspaces for personal knowledge management that use databases and Markdown pages.

We’re exposed to a constant flow of news, information, and messages from different sources, so it’s becoming more and more important to be able to quickly and efficiently organize and distill what’s useful for us.

The Power of Conciseness

Being able to condense a topic into a few pages (or lines) is considered a superpower for people who want to organize valuable information in a short amount of time. And up until a few years ago, the process of producing a summary from a given text was considered a task that could only be successfully completed by humans, and not by computers.

However, something like this is now possible thanks to things like machine learning, natural language processing, and more. The next sections will provide a brief overview of these technologies before delving into how they can help us.

Information

Rather than reinvent the wheel, these sections will include definitions compiled from other websites, as they do a better job than we can of describing these technologies.

Machine Learning at Your Service

Machine learning (ML) is a field that’s considered part of artificial intelligence (AI). According to Wikipedia, “Machine learning algorithms are used in a wide variety of applications, such as in medicine, email filtering, speech recognition, and computer vision, where it is difficult or unfeasible to develop conventional algorithms to perform the needed tasks.”

Natural Language Processing, the Technology You Didn’t Know

Natural language processing is another subfield of AI that, according to Wikipedia, is “concerned with the interactions between computers and human language, in particular how to program computers to process and analyze large amounts of natural language data.”

Natural language processing dates back to the 1950s, and some great minds, like Alan Turing, worked on automated interpretation and generation of natural language.

BART, the Model for Text Comprehension Tasks

Bart Simpson

No, not that Bart! The BART model was proposed for the first time on 29 October 2019 in the research paper entitled BART: Denoising Sequence-to-Sequence Pre-training for Natural Language Generation, Translation, and Comprehension by Mike Lewis, Yinhan Liu, Naman Goyal, Marjan Ghazvininejad, Abdelrahman Mohamed, Omer Levy, Ves Stoyanov, and Luke Zettlemoyer.

According to Hugging Face, “BART uses a standard seq2seq/machine translation architecture with a bidirectional encoder, like BERT, and a left-to-right decoder, like GPT.”

So what does that mean? Well, the paper’s abstract states that, “BART is particularly effective when fine tuned for text generation but also works well for comprehension tasks. It matches the performance of RoBERTa with comparable training resources on GLUE and SQuAD, achieves new state-of-the-art results on a range of abstractive dialogue, question answering, and summarization tasks, with gains of up to 6 ROUGE.”

Summarizing a PDF Document the Quickest Possible Way

OK enough with the jargon and boring details. Data extraction from PDF documents is already a tricky topic for most. So what we want to know is: How hard would it be to generate a summary of a PDF document without too much hassle?

The answer is: It can be done using fewer than 20 lines of code in Python! Using NLP for ML and the BART model, we can easily achieve the task of summarizing a PDF document written in English.

Getting Started: Installation and Setup

To dive into summarizing PDFs with machine learning, you’ll need a few key tools. Read on to get everything set up step by step!

  1. Install Python

First things first: make sure you have Python installed on your computer. Download it from python.org if you haven’t already.

  1. Set Up a Virtual Environment

Virtual environments keep your project dependencies organized. Here’s how to create one:

python -m venv venv

Now, activate it.

- On Windows:

```bash
	venv\Scripts\activate
```

- On macOS/Linux:

```bash
	source venv/bin/activate
```
  1. Install the Required Libraries

Next, you’ll need a couple of libraries: pdfplumber for extracting text from PDFs, and transformers with torch for summarization. Install them with:

pip install pdfplumber transformers torch
  1. Verify the Installation

Ensure everything is installed correctly by checking if you can import the libraries:

import pdfplumber
from transformers import BartTokenizer, BartForConditionalGeneration
import torch

Extract and Summarize Text from a PDF

Now that you’re set up, it’s time for the fun part: extracting text from a PDF and summarizing it with just a few lines of code.

Shut up and take my data

Extracting Text from a PDF

As a first step, you need to extract the text you want to process from a PDF document. For this task, you can use the Python library pdfplumber.

Save the following code into a Python file — for example, summarize_pdf.py:

import pdfplumber

# Open and extract text from the PDF.
with pdfplumber.open(r'document.pdf') as pdf:
    extracted_page = pdf.pages[1]  # Access the second page (index 1).
    extracted_text = extracted_page.extract_text()
    print(extracted_text)

Make sure document.pdf is in the same directory as your script or provide the full path.

Summarize the Extracted Text

Next, you can use the transformers library offered by Hugging Face and the BART tokenizer with a distilled BART model specifically trained for text summarization. The code below extracts the text and assigns it to the extracted_text variable:

from transformers import BartTokenizer, BartForConditionalGeneration

# Load the pretrained BART model and tokenizer.
model = BartForConditionalGeneration.from_pretrained('sshleifer/distilbart-cnn-12-6')
tokenizer = BartTokenizer.from_pretrained('sshleifer/distilbart-cnn-12-6')

# Tokenize the extracted text.
inputs = tokenizer([extracted_text], truncation=True, return_tensors='pt')

# Generate a summary.
summary_ids = model.generate(inputs['input_ids'], num_beams=4, early_stopping=True, min_length=0, max_length=1024)
summarized_text = [tokenizer.decode(g, skip_special_tokens=True, clean_up_tokenization_spaces=True) for g in summary_ids]

print(summarized_text[0])

Running the Script

Execute your Python script to see the magic happen:

python summarize_pdf.py

And voilà, you’ll see the extracted text from the PDF and a concise summary generated by the BART model. 😎

The full implementation can be found in this Colab file. Google Colab is a free Jupyter notebook environment that runs entirely in the cloud. Executing all the steps starting from the beginning will give you the possibility of loading a document and checking the actual output in real time.

Conclusion

The goal of this blog post was to introduce you to machine learning and natural language processing applied to PDF documents, showing the quickest possible way to summarize a text. This task requires a wide skillset, and depending on the type of language involved (e.g. something scientific, academic, or conversational), the type of model you need will vary.

If you feel you can do it in fewer than 20 lines of code, contact me on Twitter and I’ll be happy to look at your solution. 😀

Share post
Free trial Ready to get started?
Free trial