Perform OCR on PDFs in Java

This guide explains how to use optical character recognition (OCR) functionality with Nutrient Java SDK to recognize and extract text from scanned PDFs and images. Before starting, ensure you’ve followed the steps in the OCR integration guide to enable OCR in Nutrient Java SDK. For more details on OCR technology and its applications, check out our comprehensive OCR overview guide.

Perform OCR on a PDF

Using OCR with Nutrient Java SDK is straightforward. The OCR functionality is handled via the OcrProcessor:

// Specify the file to perform OCR and open it.
File file = new File("ocrExample.pdf");
PdfDocument pdfDocument = PdfDocument.open(new FileDataProvider(file));

// Create a file to write the new document that will contain the OCR data extracted from the source document.
File outputFile = File.createTempFile("ocrOutput", ".pdf");
FileDataProvider outputDataProvider = new FileDataProvider(outputFile);

// Create the processor with the open document using the processor `Builder` and perform OCR.
OcrProcessor ocrProcessor = new OcrProcessor.Builder(pdfDocument).build();
ocrProcessor.performOcr(outputDataProvider);

The code above processes the ocrExample.pdf file and saves the result to ocrOutput.pdf. The resulting PDF makes previously image-only text, such as in scanned documents, fully searchable and annotatable.

Limit to a page range

To optimize processing time, you can limit OCR to specific pages using domain-specific knowledge:

// Create the processor with the open document and perform OCR only on page `0`.
Set<Integer> pages = new HashSet<>();
pages.add(0);
OcrProcessor ocrProcessor = new OcrProcessor.Builder(pdfDocument)
        .setPages(pages)
        .build();
ocrProcessor.performOcr(outputDataProvider);

The code above restricts OCR to the first page (page index 0, as pages are zero-based). This approach drastically reduces processing time for large documents.

Language selection

To ensure accurate OCR results, specify the language of your document in OcrProcessor. By default, Nutrient Java SDK uses English, but you can change the OCR language as needed:

// Create the processor with the open document and perform OCR using the Finnish language.
OcrProcessor ocrProcessor = new OcrProcessor.Builder(pdfDocument)
        .setLanguage(OcrLanguage.Finnish)
        .build();
ocrProcessor.performOcr(outputDataProvider);

Refer to the OCR language support guide for a complete list of supported languages. This flexibility ensures precise text recognition for multilingual documents.