How to OCR PDF files on Linux using OCRmyPDF

Yasoob Khalid

Teja Tatimatla

Hulya Masharipov

Updated: September 11, 2024

How to OCR PDF files on Linux using OCRmyPDF

Optical character recognition(opens in a new tab) (OCR) is an essential technology for converting scanned documents, images, and PDFs into searchable and editable formats. This post will walk you through how to OCR PDF files on Linux using the open source tool OCRmyPDF(opens in a new tab), which is powered by Tesseract. It also discusses an alternative approach using Nutrient Document Engine. Both options provide powerful capabilities for extracting text and making PDFs searchable.

How to OCR a PDF on Linux using an open source library

This next section will go into details on how to OCR a PDF on Linux with an open source library.

Why not use Tesseract directly?

The open source library you’ll use is OCRmyPDF(opens in a new tab), which is a multi-platform tool for running OCR on PDF files. It’s a wrapper around Tesseract that does some preprocessing on PDF files before running OCR on them. This preprocessing includes deskewing, noise removal, and cleaning up files to ensure the OCR engine can read the text accurately. OCRmyPDF also does some post-processing to ensure the output is consistent and error-free. You can use Tesseract directly, but in doing so, you’ll miss out on these benefits provided by OCRmyPDF.

Key features of OCRmyPDF

Automatic OCR — Automatically adds OCR text layers to existing PDFs.
Text recognition — Utilizes Tesseract for high-quality OCR.
Multi-language support — Supports multiple languages, including English, French, German, Spanish, and more.
PDF/A conversion — Converts PDFs to the PDF/A format for long-term archiving.
Command-line interface — Provides a simple command-line interface for ease of use.

Installing OCRmyPDF

Install OCRmyPDF using the following command on Ubuntu- or Debian-based systems:

sudo apt-get install ocrmypdf

For Fedora, you can use the following command:

dnf install ocrmypdf

Sometimes, the available package version might not be the latest one, so you can install OCRmyPDF directly from PIP too:

pip install --user ocrmypdf

Just keep in mind that the PIP method won’t install some non-Python dependencies of OCRmyPDF. These dependencies include:

Python 3.8 or newer
Ghostscript 9.50 or newer
Tesseract 4.1.1 or newer
jbig2enc 0.29 or newer
pngquant 2.5 or newer
unpaper 6.1

Basic usage

To use OCRmyPDF, run the following command, replacing input.pdf with the path to the PDF file you want to OCR, and output.pdf with the path where you want to save the OCR’d PDF:

ocrmypdf input.pdf output.pdf

This will result in a PDF/A output file with an OCR layer. PDF/A is a subset of the PDF standard that prohibits features that aren’t suitable for long-term archiving. This includes JavaScript in PDFs, font linking, and encryption. You can ask OCRmyPDF to output a standard PDF via this command:

ocrmypdf --output-type pdf input.pdf output.pdf

You can even perform OCR only on certain pages:

ocrmypdf --pages 2,3,13-17 input.pdf output.pdf

OCR in a language other than English

By default, OCRmyPDF assumes a document is in English. If the language is different, the OCR quality will be considerably poor. In such a case, you need to explicitly pass in the language, like so:

ocrmypdf -l rus russian_doc.pdf russian_doc_ocr.pdf

If the document is multilingual, you can pass in multiple languages:

ocrmypdf -l rus+eng russian_doc.pdf russian_doc_ocr.pdf

Tesseract (the OCR engine used by OCRmyPDF under the hood) supports quite a few different languages. You can take a look at the Tesseract documentation(opens in a new tab) to determine if it supports your required language.

You might be required to install additional language packs before you can use them with OCRmyPDF. Follow these instructions(opens in a new tab) to figure out how to do so.

Image processing

As mentioned earlier, OCRmyPDF can perform some image processing on each page of a PDF, if required. It supports multiple options for this purpose. According to the official documentation(opens in a new tab), there are five different options. We’ve included the text from the documentation in the list below:

--rotate-pages attempts to determine the correct orientation for each page and rotates the page if necessary.
--remove-background attempts to detect and remove a noisy background from grayscale or color images. Monochrome images are ignored. This should not be used on documents that contain color photos as it may remove them.
--deskew will correct pages were scanned at a skewed angle by rotating them back into place.
--clean uses unpaper(opens in a new tab) to clean up pages before OCR, but does not alter the final output. This makes it less likely that OCR will try to find text in background noise.
--clean-final uses unpaper to clean up pages before OCR and inserts the page into the final output. You will want to review each page to ensure that unpaper did not remove something important.

Regardless of the order in which you pass these options, OCRmyPDF will always apply them in this order:

rotate -> remove background -> deskew -> clean

File optimization

By default, OCRmyPDF optimizes the output PDF for Fast Web View. This linearizes the PDF file and stores all references in the PDF file in the same order in which they’ll be viewed by the user. This slightly increases the file size as well; however, you can disable optimization by passing in --optimize 0 or -O0.

At the default optimization level, -O1, OCRmyPDF also does some lossless image optimization using JBIG2 encoder. You can disable this optimization by passing in -O0, or you can enable more aggressive lossy optimization by passing in -O2 or -O3.

Batch processing PDF files

By default, OCRmyPDF uses all available cores while processing PDF files. You can limit this by using the -j or --jobs option. This limits the number of concurrent threads used:

ocrmypdf -j 4 input.pdf output.pdf

The authors of the program also conveniently created a watcher.py file(opens in a new tab) for watching folders and performing OCR on any new PDF file. You might need to update the contents of the watcher file to suit your specific needs. Because this file has some additional dependencies, you might need to install ocrmypdf using the watcher tag:

pip install ocrmypdf[watcher]

You can then run the watcher like this:

env OCR_INPUT_DIRECTORY=./input-pdfs \
    OCR_OUTPUT_DIRECTORY=./output-pdfs \
    python3 watcher.py

This will OCR any new PDF files that are placed in the input-pdfs folder and place the resulting PDFs in the output-pdfs folder. Note that this won’t process any files that were already in the input-pdfs folder before the watcher was run.

How to OCR a PDF on Linux using Nutrient Document Engine

Nutrient Document Engine offers a powerful and scalable solution for performing OCR and managing document workflows. It’s PDF server software designed for processing documents and powering PDF automation workflows. Operating as a headless service, it can be deployed within your own infrastructure or hosted via Nutrient.

Key features of Nutrient Document Engine

HTTP-based API — Operates as a headless service for easy integration.
Flexible deployment — Deploy within your infrastructure or host via Nutrient.
Frontend SDKs — Works alongside Nutrient’s web and mobile frontend SDKs.
Prebuilt features — Includes the ability to annotate, edit, sign, form fill, redact, and more.

OCR capabilities with Document Engine

Document Engine includes custom-built OCR technology to accurately recognize text and patterns, generating searchable PDF/A files. OCR-processed PDFs can be opened in Nutrient’s Web, iOS, Android, React Native, and Flutter client SDKs.

Key features of Nutrient Document Engine for OCR

Highly accurate OCR — Document Engine includes a custom-built AI- and ML-powered OCR engine that delivers highly accurate text and pattern recognition. This enables you to convert images, scanned documents, and unstructured data into searchable and editable content.
Multi-language support — It supports multiple languages, including English, French, German, Spanish, and more, making it versatile for global applications.
Searchable PDF generation — Turn any scanned document or image into a searchable PDF or PDF/A document. This is ideal for archiving and indexing documents for quick retrieval.
Data extraction — The OCR engine can extract key-value pairs from unstructured documents, which can be particularly useful for automating workflows in industries like healthcare, finance, and legal.
Post-processing capabilities — After processing a document with OCR, you can add signatures and annotations, and even perform document assembly, enhancing your document management workflows.
Integrated viewing options — Document Engine integrates seamlessly with Nutrient’s Web, iOS, Android, React Native, and Flutter client SDKs, enabling you to open and display processed PDFs within your applications.

System Requirements

To run Nutrient Document Engine, your system must meet the following criteria:

macOS — Ventura, Monterey, Mojave, Catalina, or Big Sur
Linux — Ubuntu, Fedora, Debian, CentOS, or derivatives like Kubuntu or Xubuntu; 64-bit Intel (x86_64) and ARM (AArch64) processors are supported.

You should have a minimum of 4 GB RAM available, regardless of the operating system.

Setting up Docker

Document Engine is provided as a Docker container. To deploy it, install Docker for your operating system:

macOS — Install and start Docker Desktop for Mac. Refer to the Docker website(opens in a new tab) for instructions.
Windows — Install and start Docker Desktop for Windows. Refer to the Docker website(opens in a new tab) for instructions.
Linux — Install Docker Engine. Refer to the Docker website(opens in a new tab) for instructions.

Launching Document Engine

Once Docker is installed, follow the steps outlined below to start Document Engine.

Open your terminal:
- macOS — You can use a terminal integrated within your IDE or standalone applications like Terminal.app or iTerm2.
- Windows/Linux — Use any terminal emulator or the one provided in your IDE.
Enter the following command to start the service:

docker run --rm -t -p 5000:5000 -e API_AUTH_TOKEN=secret pspdfkit/document-engine:1.5.0

The initialization might take some time, depending on your network speed. Wait until you see a message like the following one:

[info]  2024-02-05 18:56:45.286  Running Document Engine version 1.5.0

Installing curl

To interact with Document Engine, you need to use its HTTP API by sending commands and documents in HTTP requests. For this, ensure you have curl installed:

macOS — curl is preinstalled, so no additional steps are required.
Windows — Download and install curl from the official site(opens in a new tab).
Linux — Use your package manager (e.g. sudo apt-get install curl for Debian/Ubuntu).

Performing OCR with Nutrient Document Engine

Once Document Engine is running, you can perform OCR on your PDFs by sending requests to its API.

Running OCR on document upload

To perform OCR when uploading a new document, use the ocr action within the instructions parameter in your API request:

curl -X POST http://localhost:5000/api/documents \
  -H "Authorization: Token token=<API token>" \
  -F instructions='{
    "parts": [
      {
        "file": "file-part"
      }
    ],
    "actions": [
      {
        "type": "ocr",
        "language": "english"
      }
    ]
  }' \
  -F document=@/path/to/ExampleDocument.pdf \
  -o result.pdf

POST /api/documents HTTP/1.1
Content-Type: multipart/form-data; boundary=customboundary
Authorization: Token token=<API token>

--customboundary
Content-Disposition: form-data; name="instructions"
Content-Type: application/json

{
  "parts": [
    {
      "file": "file-part"
    }
  ],
  "actions": [
    {
      "type": "ocr",
      "language": "english"
    }
  ]
}
--customboundary
Content-Disposition: form-data; name="document"; filename="Example Document.pdf"
Content-Type: application/pdf

<PDF data>
--customboundary--

This command uploads ExampleDocument.pdf, applies OCR in English, and outputs a searchable PDF named result.pdf.

Applying OCR to existing documents

If you have a document already uploaded to Document Engine, you can apply OCR using the apply_instructions endpoint:

curl -X POST http://localhost:5000/api/documents/:document_id/apply_instructions \
  -H 'Authorization: Token token=<API token>' \
  -H "Content-Type: application/json" \
  -d '{
    "parts": [
      {
        "document": {
          "id": "#self"
        }
      }
    ],
    "actions": [
      {
        "type": "ocr",
        "language": "english"
      }
    ]
  }' \
  -o result.pdf

POST /api/documents/:document_id/apply_instructions HTTP/1.1
Content-Type: application/json
Authorization: Token token=<API token>

{
  "parts": [
    {
      "document": {
        "id": "#self"
      }
    }
  ],
  "actions": [
    {
      "type": "ocr",
      "language": "english"
    }
  ]
}

Replace :document_id with your document’s ID. The #self anchor is used to refer to the current document.

Running OCR and retrieving the result without storing

To perform OCR on a document and retrieve the result without storing it in Document Engine’s storage, use the /build endpoint:

curl -X POST http://localhost:5000/api/build \
  -H "Authorization: Token token=<API token>" \
  -F instructions='{
    "parts": [
      {
        "file": "file-part"
      }
    ],
    "actions": [
      {
        "type": "ocr",
        "language": "english"
      }
    ]
  }' \
  -F document=@/path/to/ExampleDocument.pdf \
  -o result.pdf

POST /api/build HTTP/1.1
Content-Type: multipart/form-data; boundary=customboundary
Authorization: Token token=<API token>

--customboundary
Content-Disposition: form-data; name="instructions"
Content-Type: application/json

{
  "parts": [
    {
      "file": "file-part"
    }
  ],
  "actions": [
    {
      "type": "ocr",
      "language": "english"
    }
  ]
}
--customboundary
Content-Disposition: form-data; name="document"; filename="Example Document.pdf"
Content-Type: application/pdf

<PDF data>
--customboundary--

Performance considerations

Running OCR is a CPU-bound single-threaded operation. Performing many parallel OCR operations on a single Document Engine instance can cause a high load for extended periods. Some performance benchmarks on development hardware are as follows:

6-page document — ~35–40 seconds for the entire document, ~6–11 seconds per page.
1-page document — ~3–4 seconds per page.

Factors affecting performance include the number of pages, content complexity, and server hardware capabilities.

Conclusion

Both OCRmyPDF and Nutrient Document Engine offer robust OCR solutions for converting scanned documents into searchable PDFs. OCRmyPDF is a great choice for those who prefer an open source, command-line-based tool with simple setup and usage. In contrast, Nutrient Document Engine provides a more integrated, scalable, and feature-rich approach for enterprise applications.

For more information on setting up and using Nutrient Document Engine, visit the Nutrient documentation or reach out to our team to get more information.

FAQ

What is OCRmyPDF?

OCRmyPDF is an open source tool that adds OCR layers to PDF files, making them searchable and editable. It uses Tesseract for text recognition and performs preprocessing like deskewing and noise removal.

How do I install OCRmyPDF on Linux?

You can install OCRmyPDF on Ubuntu/Debian with sudo apt-get install ocrmypdf, on Fedora with dnf install ocrmypdf, or via PIP with pip install --user ocrmypdf.

How can I OCR a PDF in a language other than English?

You can specify the language using the -l flag. For example, to OCR in Russian:

ocrmypdf -l rus input.pdf output.pdf

What is Nutrient Document Engine and how is it different?

Nutrient Document Engine is an enterprise-grade, scalable solution for OCR and document management. It offers a more feature-rich OCR experience, including AI-powered text recognition, multi-language support, and integration with frontend SDKs for web and mobile.

How do I use Nutrient Document Engine for OCR?

You can deploy Document Engine using Docker and perform OCR via its HTTP API. Here’s an example using curl:

curl -X POST http://localhost:5000/api/documents \
-H "Authorization: Token token=<API token>" \
-F document=@/path/to/file.pdf \
-F instructions='{"actions":[{"type":"ocr","language":"english"}]}' \
-o result.pdf