Read Text from PDFs and Images Using C#
This guide explains how to read text from a PDF or image file. Sometimes, text is stored in a PDF or an image in a way that makes it so you cannot search or copy it. GdPicture.NET’s optical character recognition (OCR) engine allows you to recognize text and save it in a separate file where you can both search and copy and paste the text.
Reading Text from a PDF
To read text from a PDF, follow these steps:
-
Create a
GdPicturePDF
object and aGdPictureOCR
object. -
Select the source document by passing its path to the
LoadFromFile
method of theGdPicturePDF
object. -
Configure the OCR process with the
GdPictureOCR
object in the following way:-
Set the path to the OCR resource folder with the
ResourceFolder
property. The default language resources are located inGdPicture.NET 14\Redist\OCR
. For more information on adding language resources, see the language support guide. -
With the
AddLanguage
method, add the language resources that GdPicture.NET uses to recognize text in the image. This method takes an element of theOCRLanguage
enum. -
Optional: Set whether OCR prioritizes recognition accuracy or speed with the
OCRMode
property. -
Optional: Set the character allowlist with the
CharacterSet
property. When scanning the image, the OCR engine only recognizes the characters included in the allowlist. -
Optional: Set the character denylist with the
CharacterBlackList
property. When scanning the image, the OCR engine doesn’t recognize the characters included in the denylist.
-
-
Create an empty string where you’ll save the output.
-
Determine the number of pages with the
GetPageCount
method of theGdPicturePDF
object and loop through them. -
Render each page to a 300 dots-per-inch (DPI) image with the
RenderPageToGdPictureImageEx
method of theGdPicturePDF
object. -
Pass the image to the
GdPictureOCR
object with theSetImage
method of theGdPictureOCR
object. -
Run the OCR process with the
RunOCR
method of theGdPictureOCR
object. -
Get the result of the OCR process as text with the
GetOCRResultText
method of theGdPictureOCR
object, and save it in the output string. -
Release the image with the
DisposeImage
method of theGdPictureDocumentUtilities
class, and release the OCR result with theReleaseOCRResult
method of theGdPictureOCR
object. -
After reading all the pages, save the output in a new text file with the standard
System.IO.StreamWriter
class. -
Release unnecessary resources.
The example below reads text from a PDF and saves the output in a TXT file:
using GdPicturePDF gdpicturePDF = new GdPicturePDF(); using GdPictureOCR gdpictureOCR = new GdPictureOCR(); // Select the source document. gdpicturePDF.LoadFromFile(@"C:\temp\source.pdf"); // Configure the OCR process. gdpictureOCR.ResourceFolder = @"C:\GdPicture.NET 14\Redist\OCR"; gdpictureOCR.AddLanguage(OCRLanguage.English); // Create an empty string where you'll save the output. string outputText = ""; // Determine the number of pages and loop through them. int pageCount = gdpicturePDF.GetPageCount(); for (int page = 1; page <= pageCount; page++) { gdpicturePDF.SelectPage(page); // Render the page to a 300 DPI image. int imageId = gdpicturePDF.RenderPageToGdPictureImageEx(300, true); // Pass the image to the `GdPictureOCR` object. gdpictureOCR.SetImage(imageId); // Run the OCR process. string resultId = gdpictureOCR.RunOCR(); // Get the result of the OCR process as text. outputText += gdpictureOCR.GetOCRResultText(resultId); // Release the image and the OCR result. GdPictureDocumentUtilities.DisposeImage(imageId); gdpictureOCR.ReleaseOCRResult(resultId); } // Save the output in a new text file. System.IO.StreamWriter outputFile = new System.IO.StreamWriter(@"C:\temp\output.txt"); outputFile.WriteLine(outputText); outputFile.Close(); // Release unnecessary resources. gdpicturePDF.CloseDocument();
Using gdpicturePDF As GdPicturePDF = New GdPicturePDF() Using gdpictureOCR As GdPictureOCR = New GdPictureOCR() ' Select the source document. gdpicturePDF.LoadFromFile("C:\temp\source.pdf") ' Configure the OCR process. gdpictureOCR.ResourceFolder = "C:\GdPicture.NET 14\Redist\OCR" gdpictureOCR.AddLanguage(OCRLanguage.English) ' Create an empty string where you'll save the output. Dim outputText = "" ' Determine the number of pages and loop through them. Dim pageCount As Integer = gdpicturePDF.GetPageCount() For page = 1 To pageCount gdpicturePDF.SelectPage(page) ' Render the page to a 300 DPI image. Dim imageId As Integer = gdpicturePDF.RenderPageToGdPictureImageEx(300, True) ' Pass the image to the `GdPictureOCR` object. gdpictureOCR.SetImage(imageId) ' Run the OCR process. Dim resultId As String = gdpictureOCR.RunOCR() ' Get the result of the OCR process as text. outputText += gdpictureOCR.GetOCRResultText(resultId) ' Release the image and the OCR result. GdPictureDocumentUtilities.DisposeImage(imageId) gdpictureOCR.ReleaseOCRResult(resultId) Next ' Save the output in a new text file. Dim outputFile As StreamWriter = New StreamWriter("C:\temp\output.txt") outputFile.WriteLine(outputText) outputFile.Close() ' Release unnecessary resources. gdpicturePDF.CloseDocument() End Using End Using
Used Methods and Properties
Related Topics
Reading Text from an Image
This section explains how to read text from simple, single-page image files. For more information on reading multipage image files, see Reading Text from Multipage TIFF Files.
To read text from an image file, follow these steps:
-
Create a
GdPictureImaging
object and aGdPictureOCR
object. -
Select the image by passing its path to the
CreateGdPictureImageFromFile
method of theGdPictureImaging
object. -
Configure the OCR process with the
GdPictureOCR
object in the following way:-
Set the image with the
SetImage
method. -
Set the path to the OCR resource folder with the
ResourceFolder
property. The default language resources are located inGdPicture.NET 14\Redist\OCR
. For more information on adding language resources, see the language support guide. -
With the
AddLanguage
method, add the language resources that GdPicture.NET uses to recognize text in the image. This method takes a member of theOCRLanguage
enumeration. -
Optional: Set whether OCR prioritizes recognition accuracy or speed with the
OCRMode
property. -
Optional: Set the character allowlist with the
CharacterSet
property. When scanning the image, the OCR engine only recognizes the characters included in the allowlist. -
Optional: Set the character denylist with the
CharacterBlackList
property. When scanning the image, the OCR engine doesn’t recognize the characters included in the denylist.
-
-
Run the OCR process with the
RunOCR
method of theGdPictureOCR
object. -
Get the result of the OCR process as text with the
GetOCRResultText
method of theGdPictureOCR
object. -
Save the output in a new text file with the standard
System.IO.StreamWriter
class. -
Release unnecessary resources.
The example below reads text from an image file and saves the output in a TXT file:
using GdPictureImaging gdpictureImaging = new GdPictureImaging(); using GdPictureOCR gdpictureOCR = new GdPictureOCR(); // Select the image to read. int imageId = gdpictureImaging.CreateGdPictureImageFromFile(@"C:\temp\source.png"); // Configure the OCR parameters. gdpictureOCR.SetImage(imageId); gdpictureOCR.ResourceFolder = @"C:\GdPicture.NET 14\Redist\OCR"; gdpictureOCR.AddLanguage(OCRLanguage.English); // Run the OCR process. string resultId = gdpictureOCR.RunOCR(); // Get the result of the OCR process as text. string outputText = gdpictureOCR.GetOCRResultText(resultId); // Save the output in a new text file. System.IO.StreamWriter outputFile = new System.IO.StreamWriter(@"/Users/manototh/Documents/windows/temp/output.txt"); outputFile.WriteLine(outputText); outputFile.Close(); // Release unnecessary resources. gdpictureImaging.ReleaseGdPictureImage(imageId);
Using gdpictureImaging As GdPictureImaging = New GdPictureImaging() Using gdpictureOCR As GdPictureOCR = New GdPictureOCR() ' Select the image to read. Dim imageId As Integer = gdpictureImaging.CreateGdPictureImageFromFile("C:\temp\source.png") ' Configure the OCR parameters. gdpictureOCR.SetImage(imageId) gdpictureOCR.ResourceFolder = "C:\GdPicture.NET 14\Redist\OCR" gdpictureOCR.AddLanguage(OCRLanguage.English) ' Run the OCR process. Dim resultId As String = gdpictureOCR.RunOCR() ' Get the result of the OCR process as text. Dim outputText As String = gdpictureOCR.GetOCRResultText(resultId) ' Save the output in a new text file. Dim outputFile As StreamWriter = New StreamWriter("/Users/manototh/Documents/windows/temp/output.txt") outputFile.WriteLine(outputText) outputFile.Close() ' Release unnecessary resources. gdpictureImaging.ReleaseGdPictureImage(imageId) End Using End Using
Used Methods and Properties
Related Topics
Reading Text from Multipage TIFF Files
To read text from a multipage TIFF file, follow these steps:
-
Create a
GdPictureImaging
object and aGdPictureOCR
object. -
Select the image by passing its path to the
TiffCreateMultiPageFromFile
method of theGdPictureImaging
object. -
Configure the OCR process with the
GdPictureOCR
object in the following way:-
Set the path to the OCR resource folder with the
ResourceFolder
property. The default language resources are located inGdPicture.NET 14\Redist\OCR
. For more information on adding language resources, see the language support guide. -
With the
AddLanguage
method, add the language resources that GdPicture.NET uses to recognize text in the image. This method takes an element of theOCRLanguage
enum. -
Optional: Set whether OCR prioritizes recognition accuracy or speed with the
OCRMode
property. -
Optional: Set the character allowlist with the
CharacterSet
property. When scanning the image, the OCR engine only recognizes the characters included in the allowlist. -
Optional: Set the character denylist with the
CharacterBlackList
property. When scanning the image, the OCR engine doesn’t recognize the characters included in the denylist.
-
-
Create an empty string where you’ll save the output.
-
Determine the number of pages with the
GetPageCount
method of theGdPictureImaging
object and loop through them. -
Select a page with the
TiffSelectPage
method of theGdPictureImaging
object. -
Pass the page to the
GdPictureOCR
object with theSetImage
method of theGdPictureOCR
object. -
Run the OCR process with the
RunOCR
method of theGdPictureOCR
object. -
Get the result of the OCR process as text with the
GetOCRResultText
method of theGdPictureOCR
object, and save it in the output string. -
Release the OCR result with the
ReleaseOCRResult
method of theGdPictureOCR
object. -
After reading all the pages, save the output in a new text file with the standard
System.IO.StreamWriter
class. -
Release unnecessary resources.
The example below reads text from a multipage TIFF file and saves the output in a TXT file:
using GdPictureImaging gdpictureImaging = new GdPictureImaging(); using GdPictureOCR gdpictureOCR = new GdPictureOCR(); // Select the image to read. int imageId = gdpictureImaging.TiffCreateMultiPageFromFile(@"C:\temp\source.tif"); // Configure the OCR parameters. gdpictureOCR.ResourceFolder = @"C:\GdPicture.NET 14\Redist\OCR"; gdpictureOCR.AddLanguage(OCRLanguage.English); // Create an empty string where you'll save the output. string outputText = ""; // Determine the number of pages and loop through them. int pageCount = gdpictureImaging.GetPageCount(imageId); for (int page = 1; page <= pageCount; page++) { // Select a page and pass it to the `GdPictureOCR` object. gdpictureImaging.TiffSelectPage(imageId, page); gdpictureOCR.SetImage(imageId); // Run the OCR process. string resultId = gdpictureOCR.RunOCR(); // Get the result of the OCR process as text. outputText += gdpictureOCR.GetOCRResultText(resultId); // Release the OCR result. gdpictureOCR.ReleaseOCRResult(resultId); } // Save the output in a new text file. System.IO.StreamWriter outputFile = new System.IO.StreamWriter(@"C:\temp\output.txt"); outputFile.WriteLine(outputText); outputFile.Close(); // Release unnecessary resources. gdpictureImaging.ReleaseGdPictureImage(imageId);
Using gdpictureImaging As GdPictureImaging = New GdPictureImaging() Using gdpictureOCR As GdPictureOCR = New GdPictureOCR() ' Select the image to read. Dim imageId As Integer = gdpictureImaging.TiffCreateMultiPageFromFile("C:\temp\source.tif") ' Configure the OCR parameters. gdpictureOCR.ResourceFolder = "C:\GdPicture.NET 14\Redist\OCR" gdpictureOCR.AddLanguage(OCRLanguage.English) ' Create an empty string where you'll save the output. Dim outputText = "" ' Determine the number of pages and loop through them. Dim pageCount As Integer = gdpictureImaging.GetPageCount(imageId) For page = 1 To pageCount ' Select a page and pass it to the `GdPictureOCR` object. gdpictureImaging.TiffSelectPage(imageId, page) gdpictureOCR.SetImage(imageId) ' Run the OCR process. Dim resultId As String = gdpictureOCR.RunOCR() ' Get the result of the OCR process as text. outputText += gdpictureOCR.GetOCRResultText(resultId) ' Release the OCR result. gdpictureOCR.ReleaseOCRResult(resultId) Next ' Save the output in a new text file. Dim outputFile As StreamWriter = New StreamWriter("C:\temp\output.txt") outputFile.WriteLine(outputText) outputFile.Close() ' Release unnecessary resources. gdpictureImaging.ReleaseGdPictureImage(imageId) End Using End Using