OCR PDF in C#

This guide demonstrates how to convert a PDF file into a searchable PDF using Nutrient .NET SDK’s powerful OCR library for C#. With our advanced optical character recognition (OCR) engine, you can extract text from PDF files and save it in a separate PDF, enabling searchability and allowing users to copy and paste text seamlessly. This process enhances document accessibility and supports efficient text recognition in .NET applications.

To convert a PDF file into a searchable PDF using Nutrient .NET SDK’s OCR library, follow these steps:

Create a GdPicturePDF object.
Load the source document by passing its path to the LoadFromFile method of the GdPicturePDF object.
Determine the number of pages with the GetPageCount method of the GdPicturePDF object.
Iterate through pages of the source document.
For each page, run the OCR process with the OcrPage method of the GdPicturePDF object. Configure the OCR process by passing the following parameters to the OcrPage method:
1. Language settings: Set the code of the language that Nutrient .NET SDK uses to recognize text in the source document. To specify several languages, separate the language codes with the + character. For example, eng+fra.
2. OCR resource folder path: Set the path to the OCR resource folder. The default language resources are located in GdPicture.NET 14\Redist\OCR. For more information on adding language resources, see the language support guide.
3. Character allowlist: Set the character allowlist. When scanning the document, the OCR engine only recognizes the characters included in the allowlist. When you set "", all characters are recognized.
4. DPI resolution: Set the dot-per-inch (DPI) resolution the OCR engine uses. It’s recommended to use 300 for the best combination of speed and accuracy.
Save the result in a new PDF document.

This approach enables you to effectively integrate OCR-powered PDF-to-searchable-PDF functionality in .NET applications, improving text accessibility and usability.

The example below converts a PDF file to a searchable PDF:

using GdPicturePDF gdpicturePDF = new GdPicturePDF();
// Load the source document.
gdpicturePDF.LoadFromFile(@"C:\temp\source.pdf");
// Determine the number of pages.
int pageCount = gdpicturePDF.GetPageCount();
// Loop through the pages of the source document.
for (int i = 1; i <= pageCount; i++)
{
    // Select a page and run the OCR process on it.
    gdpicturePDF.SelectPage(i);
    gdpicturePDF.OcrPage("eng", @"C:\GdPicture.NET 14\Redist\OCR", "", 300);
}
// Save the result in a new PDF document.
gdpicturePDF.SaveToFile(@"C:\temp\output.pdf");
gdpicturePDF.CloseDocument();

Using gdpicturePDF As GdPicturePDF = New GdPicturePDF()
    ' Load the source document.
    gdpicturePDF.LoadFromFile("C:\temp\source.pdf")
    ' Determine the number of pages.
    Dim pageCount As Integer = gdpicturePDF.GetPageCount()
    ' Loop through the pages of the source document.
    For i = 1 To pageCount
        ' Select a page and run the OCR process on it.
        gdpicturePDF.SelectPage(i)
        gdpicturePDF.OcrPage("eng", "C:\GdPicture.NET 14\Redist\OCR", "", 300)
    Next
    ' Save the result in a new PDF document.
    gdpicturePDF.SaveToFile("C:\temp\output.pdf")
    gdpicturePDF.CloseDocument()
End Using

Used methods and properties

Related topics