Scan and OCR PDFs in C#
This guide explains how to scan a physical document with a scanner and then save the scanned image in a searchable PDF. GdPicture.NET’s optical character recognition (OCR) engine allows you to recognize text in an image and then save the text in a PDF. This guide uses the TWAIN protocol.
Printing and scanning aren’t supported in the cross-platform .NET 6.0 assembly. For more information, see the system compatibility guide.
To get an image from a scanner and then save it in a searchable PDF, follow these steps.
-
Create a
GdPictureImaging
object and aGdPicturePDF
object. -
Store the handle of the active windows in a variable by calling the
IntPtr.Zero
structure. -
Select the scanner by passing the handle to the
TwainSelectSource
and theTwainOpenDefaultSource
methods of theGdPictureImaging
object. -
Optional: Hide the scanning user interface with the
TwainSetHideUI
method of theGdPictureImaging
object. Use this setting when your application cannot communicate with the scanner. -
Create a new PDF document with the
NewPDF
method of theGdPicturePDF
object. The parameter of this method sets the conformance level of the PDF document. This parameter is a member of thePdfConformance
enumeration. For example, usePDF
to create a common PDF document. -
Get the image from the scanner by passing the handle to the
TwainAcquireToGdPictureImage
method of theGdPictureImaging
object. -
Add the scanned image to a new page in the destination document with the
AddImageFromGdPictureImage
method of theGdPicturePDF
object. -
Run the OCR process with the
RunOCR
method of theGdPictureOCR
object:-
Set the code of the language that GdPicture.NET uses to recognize text in the source document. To specify several languages, separate the language codes with the
+
character. For example,eng+fra
. -
Set the path to the OCR resource folder. The default language resources are located in
GdPicture.NET 14\Redist\OCR
. For more information on adding language resources, see the language support guide. -
Set the character allowlist. When scanning the document, the OCR engine only recognizes the characters included in the allowlist. When you set
""
, all characters are recognized. -
Set the dot-per-inch (DPI) resolution the OCR engine uses. It’s recommended to use
300
for the best combination of speed and accuracy.
-
-
Save the result in a PDF document.
-
Close the TWAIN source handle.
The example below gets an image from a scanner and then saves it in a searchable PDF:
using GdPictureImaging gdpictureImaging = new GdPictureImaging(); using GdPicturePDF gdpicturePDF = new GdPicturePDF(); // Store the handle of the active windows in a variable. IntPtr WINDOW_HANDLE = IntPtr.Zero; // Select the scanner. gdpictureImaging.TwainSelectSource(WINDOW_HANDLE); gdpictureImaging.TwainOpenDefaultSource(WINDOW_HANDLE); // (Optional) Hide the scanning user interface. gdpictureImaging.TwainSetHideUI(true); // Create the destination PDF document. gdpicturePDF.NewPDF(PdfConformance.PDF); // Get the image from the scanner. int imageID = gdpictureImaging.TwainAcquireToGdPictureImage(WINDOW_HANDLE); // Add the scanned image to a new page in the destination document. gdpicturePDF.AddImageFromGdPictureImage(imageID, false, true); // Run the OCR process. gdpicturePDF.OcrPage("eng", @"C:\GdPicture.NET 14\Redist\OCR", "", 300); // Save the result in a PDF document. gdpicturePDF.SaveToFile(@"C:\temp\output.pdf"); // Release unnecessary resources. gdpictureImaging.ReleaseGdPictureImage(imageID); gdpictureImaging.TwainCloseSource();
Using gdpictureImaging As GdPictureImaging = New GdPictureImaging() Using gdpicturePDF As GdPicturePDF = New GdPicturePDF() ' Store the handle of the active windows in a variable. Dim WINDOW_HANDLE = IntPtr.Zero ' Select the scanner. gdpictureImaging.TwainSelectSource(WINDOW_HANDLE) gdpictureImaging.TwainOpenDefaultSource(WINDOW_HANDLE) ' (Optional) Hide the scanning user interface. gdpictureImaging.TwainSetHideUI(True) ' Create the destination PDF document. gdpicturePDF.NewPDF(PdfConformance.PDF) ' Get the image from the scanner. Dim imageID As Integer = gdpictureImaging.TwainAcquireToGdPictureImage(WINDOW_HANDLE) ' Add the scanned image to a new page in the destination document. gdpicturePDF.AddImageFromGdPictureImage(imageID, False, True) ' Run the OCR process. gdpicturePDF.OcrPage("eng", "C:\GdPicture.NET 14\Redist\OCR", "", 300) ' Save the result in a PDF document. gdpicturePDF.SaveToFile("C:\temp\output.pdf") ' Release unnecessary resources. gdpictureImaging.ReleaseGdPictureImage(imageID) gdpictureImaging.TwainCloseSource() End Using End Using