How to OCR PDF files on Linux using OCRmyPDF
Table of contents
Optical character recognition(opens in a new tab) (OCR) is an essential technology for converting scanned documents, images, and PDFs into searchable and editable formats. This post will walk you through how to OCR PDF files on Linux using the open source tool OCRmyPDF(opens in a new tab), which is powered by Tesseract. It also discusses an alternative approach using Nutrient Document Engine. Both options provide powerful capabilities for extracting text and making PDFs searchable.
How to OCR a PDF on Linux using an open source library
This next section will go into details on how to OCR a PDF on Linux with an open source library.
Why not use Tesseract directly?

The open source library you’ll use is OCRmyPDF(opens in a new tab), which is a multi-platform tool for running OCR on PDF files. It’s a wrapper around Tesseract that does some preprocessing on PDF files before running OCR on them. This preprocessing includes deskewing, noise removal, and cleaning up files to ensure the OCR engine can read the text accurately. OCRmyPDF also does some post-processing to ensure the output is consistent and error-free. You can use Tesseract directly, but in doing so, you’ll miss out on these benefits provided by OCRmyPDF.
Key features of OCRmyPDF
- Automatic OCR — Automatically adds OCR text layers to existing PDFs.
- Text recognition — Utilizes Tesseract for high-quality OCR.
- Multi-language support — Supports multiple languages, including English, French, German, Spanish, and more.
- PDF/A conversion — Converts PDFs to the PDF/A format for long-term archiving.
- Command-line interface — Provides a simple command-line interface for ease of use.
Installing OCRmyPDF
Install OCRmyPDF using the following command on Ubuntu- or Debian-based systems:
sudo apt-get install ocrmypdfFor Fedora, you can use the following command:
dnf install ocrmypdfSometimes, the available package version might not be the latest one, so you can install OCRmyPDF directly from PIP too:
pip install --user ocrmypdfJust keep in mind that the PIP method won’t install some non-Python dependencies of OCRmyPDF. These dependencies include:
- Python 3.8 or newer
- Ghostscript 9.50 or newer
- Tesseract 4.1.1 or newer
- jbig2enc 0.29 or newer
- pngquant 2.5 or newer
- unpaper 6.1
Basic usage
To use OCRmyPDF, run the following command, replacing input.pdf with the path to the PDF file you want to OCR, and output.pdf with the path where you want to save the OCR’d PDF:
ocrmypdf input.pdf output.pdfThis will result in a PDF/A output file with an OCR layer. PDF/A is a subset of the PDF standard that prohibits features that aren’t suitable for long-term archiving. This includes JavaScript in PDFs, font linking, and encryption. You can ask OCRmyPDF to output a standard PDF via this command:
ocrmypdf --output-type pdf input.pdf output.pdfYou can even perform OCR only on certain pages:
ocrmypdf --pages 2,3,13-17 input.pdf output.pdfOCR in a language other than English
By default, OCRmyPDF assumes a document is in English. If the language is different, the OCR quality will be considerably poor. In such a case, you need to explicitly pass in the language, like so:
ocrmypdf -l rus russian_doc.pdf russian_doc_ocr.pdfIf the document is multilingual, you can pass in multiple languages:
ocrmypdf -l rus+eng russian_doc.pdf russian_doc_ocr.pdfTesseract (the OCR engine used by OCRmyPDF under the hood) supports quite a few different languages. You can take a look at the Tesseract documentation(opens in a new tab) to determine if it supports your required language.
You might be required to install additional language packs before you can use them with OCRmyPDF. Follow these instructions(opens in a new tab) to figure out how to do so.
Image processing
As mentioned earlier, OCRmyPDF can perform some image processing on each page of a PDF, if required. It supports multiple options for this purpose. According to the official documentation(opens in a new tab), there are five different options. We’ve included the text from the documentation in the list below:
--rotate-pagesattempts to determine the correct orientation for each page and rotates the page if necessary.--remove-backgroundattempts to detect and remove a noisy background from grayscale or color images. Monochrome images are ignored. This should not be used on documents that contain color photos as it may remove them.--deskewwill correct pages were scanned at a skewed angle by rotating them back into place.--cleanuses unpaper(opens in a new tab) to clean up pages before OCR, but does not alter the final output. This makes it less likely that OCR will try to find text in background noise.--clean-finaluses unpaper to clean up pages before OCR and inserts the page into the final output. You will want to review each page to ensure that unpaper did not remove something important.
Regardless of the order in which you pass these options, OCRmyPDF will always apply them in this order:
rotate -> remove background -> deskew -> cleanFile optimization
By default, OCRmyPDF optimizes the output PDF for Fast Web View. This linearizes the PDF file and stores all references in the PDF file in the same order in which they’ll be viewed by the user. This slightly increases the file size as well; however, you can disable optimization by passing in --optimize 0 or -O0.
At the default optimization level, -O1, OCRmyPDF also does some lossless image optimization using JBIG2 encoder. You can disable this optimization by passing in -O0, or you can enable more aggressive lossy optimization by passing in -O2 or -O3.
Batch processing PDF files
By default, OCRmyPDF uses all available cores while processing PDF files. You can limit this by using the -j or --jobs option. This limits the number of concurrent threads used:
ocrmypdf -j 4 input.pdf output.pdfThe authors of the program also conveniently created a watcher.py file(opens in a new tab) for watching folders and performing OCR on any new PDF file. You might need to update the contents of the watcher file to suit your specific needs. Because this file has some additional dependencies, you might need to install ocrmypdf using the watcher tag:
pip install ocrmypdf[watcher]You can then run the watcher like this:
env OCR_INPUT_DIRECTORY=./input-pdfs \ OCR_OUTPUT_DIRECTORY=./output-pdfs \ python3 watcher.pyThis will OCR any new PDF files that are placed in the input-pdfs folder and place the resulting PDFs in the output-pdfs folder. Note that this won’t process any files that were already in the input-pdfs folder before the watcher was run.
How to OCR a PDF on Linux using Nutrient Document Engine
Nutrient Document Engine offers a powerful and scalable solution for performing OCR and managing document workflows. It’s PDF server software designed for processing documents and powering PDF automation workflows. Operating as a headless service, it can be deployed within your own infrastructure or hosted via Nutrient.
Key features of Nutrient Document Engine
- HTTP-based API — Operates as a headless service for easy integration.
- Flexible deployment — Deploy within your infrastructure or host via Nutrient.
- Frontend SDKs — Works alongside Nutrient’s web and mobile frontend SDKs.
- Prebuilt features — Includes the ability to annotate, edit, sign, form fill, redact, and more.
OCR capabilities with Document Engine
Document Engine includes custom-built OCR technology to accurately recognize text and patterns, generating searchable PDF/A files. OCR-processed PDFs can be opened in Nutrient’s Web, iOS, Android, React Native, and Flutter client SDKs.
Key features of Nutrient Document Engine for OCR
- Highly accurate OCR — Document Engine includes a custom-built AI- and ML-powered OCR engine that delivers highly accurate text and pattern recognition. This enables you to convert images, scanned documents, and unstructured data into searchable and editable content.
- Multi-language support — It supports multiple languages, including English, French, German, Spanish, and more, making it versatile for global applications.
- Searchable PDF generation — Turn any scanned document or image into a searchable PDF or PDF/A document. This is ideal for archiving and indexing documents for quick retrieval.
- Data extraction — The OCR engine can extract key-value pairs from unstructured documents, which can be particularly useful for automating workflows in industries like healthcare, finance, and legal.
- Post-processing capabilities — After processing a document with OCR, you can add signatures and annotations, and even perform document assembly, enhancing your document management workflows.
- Integrated viewing options — Document Engine integrates seamlessly with Nutrient’s Web, iOS, Android, React Native, and Flutter client SDKs, enabling you to open and display processed PDFs within your applications.
System Requirements
To run Nutrient Document Engine, your system must meet the following criteria:
- macOS — Ventura, Monterey, Mojave, Catalina, or Big Sur
- Linux — Ubuntu, Fedora, Debian, CentOS, or derivatives like Kubuntu or Xubuntu; 64-bit Intel (x86_64) and ARM (AArch64) processors are supported.
You should have a minimum of 4 GB RAM available, regardless of the operating system.
Setting up Docker
Document Engine is provided as a Docker container. To deploy it, install Docker for your operating system:
- macOS — Install and start Docker Desktop for Mac. Refer to the Docker website(opens in a new tab) for instructions.
- Windows — Install and start Docker Desktop for Windows. Refer to the Docker website(opens in a new tab) for instructions.
- Linux — Install Docker Engine. Refer to the Docker website(opens in a new tab) for instructions.
Launching Document Engine
Once Docker is installed, follow the steps outlined below to start Document Engine.
- Open your terminal:
- macOS — You can use a terminal integrated within your IDE or standalone applications like
Terminal.apporiTerm2. - Windows/Linux — Use any terminal emulator or the one provided in your IDE.
- macOS — You can use a terminal integrated within your IDE or standalone applications like
- Enter the following command to start the service:
docker run --rm -t -p 5000:5000 -e API_AUTH_TOKEN=secret pspdfkit/document-engine:1.5.0The initialization might take some time, depending on your network speed. Wait until you see a message like the following one:
[info] 2024-02-05 18:56:45.286 Running Document Engine version 1.5.0Installing curl
To interact with Document Engine, you need to use its HTTP API by sending commands and documents in HTTP requests. For this, ensure you have curl installed:
- macOS —
curlis preinstalled, so no additional steps are required. - Windows — Download and install
curlfrom the official site(opens in a new tab). - Linux — Use your package manager (e.g.
sudo apt-get install curlfor Debian/Ubuntu).
Performing OCR with Nutrient Document Engine
Once Document Engine is running, you can perform OCR on your PDFs by sending requests to its API.
Running OCR on document upload
To perform OCR when uploading a new document, use the
ocraction within theinstructionsparameter in your API request:Terminal window curl -X POST http://localhost:5000/api/documents \-H "Authorization: Token token=<API token>" \-F instructions='{"parts": [{"file": "file-part"}],"actions": [{"type": "ocr","language": "english"}]}' \-F document=@/path/to/ExampleDocument.pdf \-o result.pdfPOST /api/documents HTTP/1.1Content-Type: multipart/form-data; boundary=customboundaryAuthorization: Token token=<API token>--customboundaryContent-Disposition: form-data; name="instructions"Content-Type: application/json{"parts": [{"file": "file-part"}],"actions": [{"type": "ocr","language": "english"}]}--customboundaryContent-Disposition: form-data; name="document"; filename="Example Document.pdf"Content-Type: application/pdf<PDF data>--customboundary--This command uploads
ExampleDocument.pdf, applies OCR in English, and outputs a searchable PDF namedresult.pdf.Applying OCR to existing documents
If you have a document already uploaded to Document Engine, you can apply OCR using the
apply_instructionsendpoint:Terminal window curl -X POST http://localhost:5000/api/documents/:document_id/apply_instructions \-H 'Authorization: Token token=<API token>' \-H "Content-Type: application/json" \-d '{"parts": [{"document": {"id": "#self"}}],"actions": [{"type": "ocr","language": "english"}]}' \-o result.pdfPOST /api/documents/:document_id/apply_instructions HTTP/1.1Content-Type: application/jsonAuthorization: Token token=<API token>{"parts": [{"document": {"id": "#self"}}],"actions": [{"type": "ocr","language": "english"}]}Replace
:document_idwith your document’s ID. The#selfanchor is used to refer to the current document.Running OCR and retrieving the result without storing
To perform OCR on a document and retrieve the result without storing it in Document Engine’s storage, use the /build endpoint:
curl -X POST http://localhost:5000/api/build \ -H "Authorization: Token token=<API token>" \ -F instructions='{ "parts": [ { "file": "file-part" } ], "actions": [ { "type": "ocr", "language": "english" } ] }' \ -F document=@/path/to/ExampleDocument.pdf \ -o result.pdfPOST /api/build HTTP/1.1Content-Type: multipart/form-data; boundary=customboundaryAuthorization: Token token=<API token>
--customboundaryContent-Disposition: form-data; name="instructions"Content-Type: application/json
{ "parts": [ { "file": "file-part" } ], "actions": [ { "type": "ocr", "language": "english" } ]}--customboundaryContent-Disposition: form-data; name="document"; filename="Example Document.pdf"Content-Type: application/pdf
<PDF data>--customboundary--Performance considerations
Running OCR is a CPU-bound single-threaded operation. Performing many parallel OCR operations on a single Document Engine instance can cause a high load for extended periods. Some performance benchmarks on development hardware are as follows:
- 6-page document — ~35–40 seconds for the entire document, ~6–11 seconds per page.
- 1-page document — ~3–4 seconds per page.
Factors affecting performance include the number of pages, content complexity, and server hardware capabilities.
Conclusion
Both OCRmyPDF and Nutrient Document Engine offer robust OCR solutions for converting scanned documents into searchable PDFs. OCRmyPDF is a great choice for those who prefer an open source, command-line-based tool with simple setup and usage. In contrast, Nutrient Document Engine provides a more integrated, scalable, and feature-rich approach for enterprise applications.
For more information on setting up and using Nutrient Document Engine, visit the Nutrient documentation or reach out to our team to get more information.
FAQ
OCRmyPDF is an open source tool that adds OCR layers to PDF files, making them searchable and editable. It uses Tesseract for text recognition and performs preprocessing like deskewing and noise removal.
You can install OCRmyPDF on Ubuntu/Debian with sudo apt-get install ocrmypdf, on Fedora with dnf install ocrmypdf, or via PIP with pip install --user ocrmypdf.
You can specify the language using the -l flag. For example, to OCR in Russian:
ocrmypdf -l rus input.pdf output.pdf
Nutrient Document Engine is an enterprise-grade, scalable solution for OCR and document management. It offers a more feature-rich OCR experience, including AI-powered text recognition, multi-language support, and integration with frontend SDKs for web and mobile.
You can deploy Document Engine using Docker and perform OCR via its HTTP API. Here’s an example using curl:
curl -X POST http://localhost:5000/api/documents \-H "Authorization: Token token=<API token>" \-F document=@/path/to/file.pdf \-F instructions='{"actions":[{"type":"ocr","language":"english"}]}' \-o result.pdf