OCR PDF in C#
This guide explains how to convert a PDF file to a searchable PDF. GdPicture.NET’s optical character recognition (OCR) engine allows you to recognize text and save it in a separate PDF where you can both search and copy and paste the text.
To convert a PDF file to a searchable PDF, follow the steps outlined below.
-
Create a
GdPicturePDF
object. -
Load the source document by passing its path to the
LoadFromFile
method of theGdPicturePDF
object. -
Determine the number of pages with the
GetPageCount
method of theGdPicturePDF
object. -
Loop through pages of the source document.
-
For each page, run the OCR process with the
OcrPage
method of theGdPicturePDF
object. Configure the OCR process by passing the following parameters to theOcrPage
method:-
Set the code of the language that GdPicture.NET uses to recognize text in the source document. To specify several languages, separate the language codes with the
+
character — for example,eng+fra
. -
Set the path to the OCR resource folder. The default language resources are located in
GdPicture.NET 14\Redist\OCR
. For more information on adding language resources, see the language support guide. -
Set the character allowlist. When scanning the document, the OCR engine only recognizes the characters included in the allowlist. When you set
""
, all characters are recognized. -
Set the dot-per-inch (DPI) resolution the OCR engine uses. It’s recommended to use
300
for the best combination of speed and accuracy.
-
-
Save the result in a new PDF document.
The example below converts a PDF file to a searchable PDF:
using GdPicturePDF gdpicturePDF = new GdPicturePDF(); // Load the source document. gdpicturePDF.LoadFromFile(@"C:\temp\source.pdf"); // Determine the number of pages. int pageCount = gdpicturePDF.GetPageCount(); // Loop through the pages of the source document. for (int i = 1; i <= pageCount; i++) { // Select a page and run the OCR process on it. gdpicturePDF.SelectPage(i); gdpicturePDF.OcrPage("eng", @"C:\GdPicture.NET 14\Redist\OCR", "", 300); } // Save the result in a new PDF document. gdpicturePDF.SaveToFile(@"C:\temp\output.pdf"); gdpicturePDF.CloseDocument();
Using gdpicturePDF As GdPicturePDF = New GdPicturePDF() ' Load the source document. gdpicturePDF.LoadFromFile("C:\temp\source.pdf") ' Determine the number of pages. Dim pageCount As Integer = gdpicturePDF.GetPageCount() ' Loop through the pages of the source document. For i = 1 To pageCount ' Select a page and run the OCR process on it. gdpicturePDF.SelectPage(i) gdpicturePDF.OcrPage("eng", "C:\GdPicture.NET 14\Redist\OCR", "", 300) Next ' Save the result in a new PDF document. gdpicturePDF.SaveToFile("C:\temp\output.pdf") gdpicturePDF.CloseDocument() End Using