C# OCR Image to Text
This guide explains how to convert image files to searchable PDFs. GdPicture.NET’s optical character recognition (OCR) engine allows you to recognize the text in an image file and then save the text in a PDF.
Converting Image Files to Searchable PDFs
This section explains how to convert simple, single-page image files. For more information on converting multipage image files, see Converting Multipage TIFF Files to Searchable PDFs.
To convert an image file to a searchable PDF, follow these steps.
-
Create a
GdPicturePDF
object, aGdPictureImaging
object, and aGdPictureOCR
object. -
Select the image by passing its path to the
CreateGdPictureImageFromFile
method of theGdPictureImaging
object. -
Configure the OCR process with the
GdPictureOCR
object in the following way:-
Set the image with the
SetImage
method. -
Set the path to the OCR resource folder with the
ResourceFolder
property. The default language resources are located inGdPicture.NET 14\Redist\OCR
. For more information on adding language resources, see the language support guide. -
With the
AddLanguage
method, add the language resources that GdPicture.NET uses to recognize text in the image. This method takes a member of theOCRLanguage
enumeration. -
Optional: Set whether OCR prioritizes recognition accuracy or speed with the
OCRMode
property. -
Optional: Set the character allowlist with the
CharacterSet
property. When scanning the image, the OCR engine only recognizes the characters included in the allowlist. -
Optional: Set the character denylist with the
CharacterBlackList
property. When scanning the image, the OCR engine ignores the characters included in the denylist.
-
-
Run the OCR process with the
RunOCR
method of theGdPictureOCR
object. -
Get the result of the OCR process as text with the
GetOCRResultText
method of theGdPictureOCR
object. -
Create the output with the
CreateFromText
method. The first parameter sets the conformance level of the PDF document. This parameter is a member of thePdfConformance
enumeration. For example, usePDF
to create a common PDF document. -
Save the output in a PDF document.
The example below converts an image file to a searchable PDF by specifying the language of the text:
using GdPicturePDF gdpicturePDF = new GdPicturePDF(); using GdPictureImaging gdpictureImaging = new GdPictureImaging(); using GdPictureOCR gdpictureOCR = new GdPictureOCR(); // Select the image to process. int imageID = gdpictureImaging.CreateGdPictureImageFromFile(@"C:\temp\source.png"); // Set the OCR parameters. gdpictureOCR.SetImage(imageID); gdpictureOCR.ResourceFolder = @"C:\GdPicture.NET 14\Redist\OCR"; gdpictureOCR.AddLanguage(OCRLanguage.English); // Run the OCR process. string resID = gdpictureOCR.RunOCR(); // Get the result of the OCR process as text. string content = gdpictureOCR.GetOCRResultText(resID); // Save the result in a PDF document. gdpicturePDF.CreateFromText(PdfConformance.PDF, 595, 842, 10, 10, 10, 10, TextAlignment.TextAlignmentNear, content, 12, "Arial", false, false, true, false); gdpicturePDF.SaveToFile(@"C:\temp\output.pdf"); gdpictureImaging.ReleaseGdPictureImage(imageID);
Using gdpicturePDF As GdPicturePDF = New GdPicturePDF() Using gdpictureImaging As GdPictureImaging = New GdPictureImaging() Using gdpictureOCR As GdPictureOCR = New GdPictureOCR() ' Select the image to process. Dim imageID As Integer = gdpictureImaging.CreateGdPictureImageFromFile("C:\temp\source.png") ' Set the OCR parameters. gdpictureOCR.SetImage(imageID) gdpictureOCR.ResourceFolder = "C:\GdPicture.NET 14\Redist\OCR" gdpictureOCR.AddLanguage(OCRLanguage.English) ' Run the OCR process. Dim resID As String = gdpictureOCR.RunOCR() ' Get the result of the OCR process as text. Dim content As String = gdpictureOCR.GetOCRResultText(resID) ' Save the result in a PDF document. gdpicturePDF.CreateFromText(PdfConformance.PDF, 595, 842, 10, 10, 10, 10, TextAlignment.TextAlignmentNear, content, 12, "Arial", False, False, True, False) gdpicturePDF.SaveToFile("C:\temp\output.pdf") gdpictureImaging.ReleaseGdPictureImage(imageID) End Using End Using End Using
The example below converts an image file to a searchable PDF. It specifies two languages, favors speed over accuracy, and disregards numbers when scanning the image:
using GdPicturePDF gdpicturePDF = new GdPicturePDF(); using GdPictureImaging gdpictureImaging = new GdPictureImaging(); using GdPictureOCR gdpictureOCR = new GdPictureOCR(); // Select the image to process. int imageID = gdpictureImaging.CreateGdPictureImageFromFile(@"C:\temp\source.png"); // Set the OCR parameters. gdpictureOCR.SetImage(imageID); gdpictureOCR.ResourceFolder = @"C:\GdPicture.NET 14\Redist\OCR"; gdpictureOCR.AddLanguage(OCRLanguage.English); gdpictureOCR.AddLanguage(OCRLanguage.German); gdpictureOCR.OCRMode = OCRMode.FavorSpeed; gdpictureOCR.CharacterBlackList = "0123456789"; // Run the OCR process. string resID = gdpictureOCR.RunOCR(); // Get the result of the OCR process as text. string content = gdpictureOCR.GetOCRResultText(resID); // Save the result in a PDF document. gdpicturePDF.CreateFromText(PdfConformance.PDF, 595, 842, 10, 10, 10, 10, TextAlignment.TextAlignmentNear, content, 12, "Arial", false, false, true, false); gdpicturePDF.SaveToFile(@"C:\temp\output.pdf"); gdpictureImaging.ReleaseGdPictureImage(imageID);
Using gdpicturePDF As GdPicturePDF = New GdPicturePDF() Using gdpictureImaging As GdPictureImaging = New GdPictureImaging() Using gdpictureOCR As GdPictureOCR = New GdPictureOCR() ' Select the image to process. Dim imageID As Integer = gdpictureImaging.CreateGdPictureImageFromFile("C:\temp\source.png") ' Set the OCR parameters. gdpictureOCR.SetImage(imageID) gdpictureOCR.ResourceFolder = "C:\GdPicture.NET 14\Redist\OCR" gdpictureOCR.AddLanguage(OCRLanguage.English) gdpictureOCR.AddLanguage(OCRLanguage.German) gdpictureOCR.OCRMode = OCRMode.FavorSpeed gdpictureOCR.CharacterBlackList = "0123456789" ' Run the OCR process. Dim resID As String = gdpictureOCR.RunOCR() ' Get the result of the OCR process as text. Dim content As String = gdpictureOCR.GetOCRResultText(resID) ' Save the result in a PDF document. gdpicturePDF.CreateFromText(PdfConformance.PDF, 595, 842, 10, 10, 10, 10, TextAlignment.TextAlignmentNear, content, 12, "Arial", False, False, True, False) gdpicturePDF.SaveToFile("C:\temp\output.pdf") gdpictureImaging.ReleaseGdPictureImage(imageID) End Using End Using End Using
Used Methods and Properties
Related Topics
Converting Multipage TIFF Files to Searchable PDFs
To convert an image file to a searchable PDF, follow these steps.
-
Create a
GdPicturePDF
object, aGdPictureImaging
object, and aGdPictureOCR
object. -
Select the image by passing its path to the
TiffCreateMultiPageFromFile
method of theGdPictureImaging
object. -
Determine the number of pages with the
GetPageCount
method of theGdPictureImaging
object. -
Configure the OCR process with the
GdPictureOCR
object in the following way:-
Set the path to the OCR resource folder with the
ResourceFolder
property. The default language resources are located inGdPicture.NET 14\Redist\OCR
. For more information on adding language resources, see the language support guide. -
With the
AddLanguage
method, add the language resources that GdPicture.NET uses to recognize text in the image. This method takes a member of theOCRLanguage
enumeration. -
Optional: Set whether OCR prioritizes recognition accuracy or speed with the
OCRMode
property. -
Optional: Set the character allowlist with the
CharacterSet
property. When scanning the image, the OCR engine only recognizes the characters included in the allowlist. -
Optional: Set the character denylist with the
CharacterBlackList
property. When scanning the image, the OCR engine doesn’t recognize the characters included in the denylist.
-
-
Create the resulting PDF document with the
NewPDF
method of theGdPicturePDF
object. The parameter of this method sets the conformance level of the PDF document. This parameter is a member of thePdfConformance
enumeration. For example, usePDF
to create a common PDF document. -
Loop through pages of the image file.
-
For each page, run the OCR process with the
RunOCR
method of theGdPictureOCR
object, and then add the result to a new page in the PDF. -
Save the output in a PDF document.
The example below converts a multipage TIFF file to a searchable PDF:
using GdPicturePDF gdpicturePDF = new GdPicturePDF(); using GdPictureImaging gdpictureImaging = new GdPictureImaging(); using GdPictureOCR gdpictureOCR = new GdPictureOCR(); // Select the image to process. int imageID = gdpictureImaging.TiffCreateMultiPageFromFile(@"C:\temp\source.tif"); // Determine the number of pages. int pageCount = gdpictureImaging.GetPageCount(imageID); // Set the OCR parameters. gdpictureOCR.ResourceFolder = @"C:\GdPicture.NET 14\Redist\OCR"; gdpictureOCR.AddLanguage(OCRLanguage.English); // Create the resulting PDF document. gdpicturePDF.NewPDF(PdfConformance.PDF); gdpicturePDF.SetOrigin(PdfOrigin.PdfOriginTopLeft); string fontResName = gdpicturePDF.AddStandardFont(PdfStandardFont.PdfStandardFontCourier); // Loop through pages of the image file. string resID = "page"; string content = null; for (int i = 1; i <= pageCount; i++) { // Select the current page and set up the image for OCR. gdpictureImaging.TiffSelectPage(imageID, i); gdpictureOCR.SetImage(imageID); // Run the OCR process on the current page. gdpictureOCR.RunOCR(resID); // Get the result. content = gdpictureOCR.GetOCRResultText(resID); // Add the result to a new page in the PDF. gdpicturePDF.NewPage(PdfPageSizes.PdfPageSizeA4); gdpicturePDF.DrawText(fontResName, 10, 10, content); // Release the previous OCR result. This improves memory management. gdpictureOCR.ReleaseOCRResult(resID); } // Save the resulting PDF document. gdpicturePDF.SaveToFile(@"C:\temp\output.pdf"); gdpicturePDF.CloseDocument(); gdpictureImaging.ReleaseGdPictureImage(imageID);
Using gdpicturePDF As GdPicturePDF = New GdPicturePDF() Using gdpictureImaging As GdPictureImaging = New GdPictureImaging() Using gdpictureOCR As GdPictureOCR = New GdPictureOCR() ' Select the image to process. Dim imageID As Integer = gdpictureImaging.TiffCreateMultiPageFromFile("C:\temp\source.tif") ' Determine the number of pages. Dim pageCount = 0 pageCount = gdpictureImaging.GetPageCount(imageID) ' Set the OCR parameters. gdpictureOCR.ResourceFolder = "C:\GdPicture.NET 14\Redist\OCR" gdpictureOCR.AddLanguage(OCRLanguage.English) ' Create the resulting PDF document. gdpicturePDF.NewPDF(PdfConformance.PDF) gdpicturePDF.SetOrigin(PdfOrigin.PdfOriginTopLeft) Dim fontResName As String = gdpicturePDF.AddStandardFont(PdfStandardFont.PdfStandardFontCourier) ' Loop through pages of the image file. Dim resID = "page" Dim content As String = Nothing For i = 1 To pageCount ' Select the current page and set up the image for OCR. gdpictureImaging.TiffSelectPage(imageID, i) gdpictureOCR.SetImage(imageID) ' Run the OCR process on the current page. gdpictureOCR.RunOCR(resID) ' Get the result. content = gdpictureOCR.GetOCRResultText(resID) ' Add the result to a new page in the PDF. gdpicturePDF.NewPage(PdfPageSizes.PdfPageSizeA4) gdpicturePDF.DrawText(fontResName, 10, 10, content) ' Release the previous OCR result. This improves memory management. gdpictureOCR.ReleaseOCRResult(resID) Next ' Save the resulting PDF document. gdpicturePDF.SaveToFile("C:\temp\output.pdf") gdpicturePDF.CloseDocument() gdpictureImaging.ReleaseGdPictureImage(imageID) End Using End Using End Using