Read text from PDFs and images in C#
This guide explains how to read text from a PDF or image file. Sometimes, text is stored in a PDF or an image in a way that makes it so you cannot search or copy it. Nutrient .NET SDK’s (formerly GdPicture.NET) optical character recognition (OCR) engine enables you to recognize text and save it in a separate file where you can both search and copy and paste the text.
Reading text from a PDF
To read text from a PDF, follow the steps below:
- Create a
GdPicturePDF
object and aGdPictureOCR
object. - Select the source document by passing its path to the
LoadFromFile
method of theGdPicturePDF
object. - Configure the OCR process with the
GdPictureOCR
object in the following way:- Set the path to the OCR resource folder with the
ResourceFolder
property. The default language resources are located inGdPicture.NET 14\Redist\OCR
. For more information on adding language resources, see the language support guide. - With the
AddLanguage
method, add the language resources that Nutrient .NET SDK uses to recognize text in the image. This method takes an element of theOCRLanguage
enum. - Optional: Set whether OCR prioritizes recognition accuracy or speed with the
OCRMode
property. - Optional: Set the character allowlist with the
CharacterSet
property. When scanning the image, the OCR engine only recognizes the characters included in the allowlist. - Optional: Set the character denylist with the
CharacterBlackList
property. When scanning the image, the OCR engine doesn’t recognize the characters included in the denylist.
- Set the path to the OCR resource folder with the
- Create an empty string where you’ll save the output.
- Determine the number of pages with the
GetPageCount
method of theGdPicturePDF
object and loop through them. - Render each page to a 300 dots-per-inch (DPI) image with the
RenderPageToGdPictureImageEx
method of theGdPicturePDF
object. - Pass the image to the
GdPictureOCR
object with theSetImage
method of theGdPictureOCR
object. - Run the OCR process with the
RunOCR
method of theGdPictureOCR
object. - Get the result of the OCR process as text with the
GetOCRResultText
method of theGdPictureOCR
object, and save it in the output string. - Release the image with the
DisposeImage
method of theGdPictureDocumentUtilities
class, and release the OCR result with theReleaseOCRResult
method of theGdPictureOCR
object. - After reading all the pages, save the output in a new text file with the standard
System.IO.StreamWriter
class. - Release unnecessary resources.
The example below reads text from a PDF and saves the output in a TXT file:
using GdPicturePDF gdpicturePDF = new GdPicturePDF();using GdPictureOCR gdpictureOCR = new GdPictureOCR();// Select the source document.gdpicturePDF.LoadFromFile(@"C:\temp\source.pdf");// Configure the OCR process.gdpictureOCR.ResourceFolder = @"C:\GdPicture.NET 14\Redist\OCR";gdpictureOCR.AddLanguage(OCRLanguage.English);// Create an empty string where you'll save the output.string outputText = "";// Determine the number of pages and loop through them.int pageCount = gdpicturePDF.GetPageCount();for (int page = 1; page <= pageCount; page++){ gdpicturePDF.SelectPage(page); // Render the page to a 300 DPI image. int imageId = gdpicturePDF.RenderPageToGdPictureImageEx(300, true); // Pass the image to the `GdPictureOCR` object. gdpictureOCR.SetImage(imageId); // Run the OCR process. string resultId = gdpictureOCR.RunOCR(); // Get the result of the OCR process as text. outputText += gdpictureOCR.GetOCRResultText(resultId); // Release the image and the OCR result. GdPictureDocumentUtilities.DisposeImage(imageId); gdpictureOCR.ReleaseOCRResult(resultId);}// Save the output in a new text file.System.IO.StreamWriter outputFile = new System.IO.StreamWriter(@"C:\temp\output.txt");outputFile.WriteLine(outputText);outputFile.Close();// Release unnecessary resources.gdpicturePDF.CloseDocument();
Using gdpicturePDF As GdPicturePDF = New GdPicturePDF()Using gdpictureOCR As GdPictureOCR = New GdPictureOCR() ' Select the source document. gdpicturePDF.LoadFromFile("C:\temp\source.pdf") ' Configure the OCR process. gdpictureOCR.ResourceFolder = "C:\GdPicture.NET 14\Redist\OCR" gdpictureOCR.AddLanguage(OCRLanguage.English) ' Create an empty string where you'll save the output. Dim outputText = "" ' Determine the number of pages and loop through them. Dim pageCount As Integer = gdpicturePDF.GetPageCount() For page = 1 To pageCount gdpicturePDF.SelectPage(page) ' Render the page to a 300 DPI image. Dim imageId As Integer = gdpicturePDF.RenderPageToGdPictureImageEx(300, True) ' Pass the image to the `GdPictureOCR` object. gdpictureOCR.SetImage(imageId) ' Run the OCR process. Dim resultId As String = gdpictureOCR.RunOCR() ' Get the result of the OCR process as text. outputText += gdpictureOCR.GetOCRResultText(resultId) ' Release the image and the OCR result. GdPictureDocumentUtilities.DisposeImage(imageId) gdpictureOCR.ReleaseOCRResult(resultId) Next ' Save the output in a new text file. Dim outputFile As StreamWriter = New StreamWriter("C:\temp\output.txt") outputFile.WriteLine(outputText) outputFile.Close() ' Release unnecessary resources. gdpicturePDF.CloseDocument()End UsingEnd Using
Used methods and properties
Related topics
Reading text from an image
This section explains how to read text from simple, single-page image files. For more information on reading multipage image files, see Reading Text from Multipage TIFF Files.
To read text from an image file, follow the steps below:
- Create a
GdPictureImaging
object and aGdPictureOCR
object. - Select the image by passing its path to the
CreateGdPictureImageFromFile
method of theGdPictureImaging
object. - Configure the OCR process with the
GdPictureOCR
object in the following way:- Set the image with the
SetImage
method. - Set the path to the OCR resource folder with the
ResourceFolder
property. The default language resources are located inGdPicture.NET 14\Redist\OCR
. For more information on adding language resources, see the language support guide. - With the
AddLanguage
method, add the language resources that Nutrient .NET SDK uses to recognize text in the image. This method takes a member of theOCRLanguage
enumeration. - Optional: Set whether OCR prioritizes recognition accuracy or speed with the
OCRMode
property. - Optional: Set the character allowlist with the
CharacterSet
property. When scanning the image, the OCR engine only recognizes the characters included in the allowlist. - Optional: Set the character denylist with the
CharacterBlackList
property. When scanning the image, the OCR engine doesn’t recognize the characters included in the denylist.
- Set the image with the
- Run the OCR process with the
RunOCR
method of theGdPictureOCR
object. - Get the result of the OCR process as text with the
GetOCRResultText
method of theGdPictureOCR
object. - Save the output in a new text file with the standard
System.IO.StreamWriter
class. - Release unnecessary resources.
The example below reads text from an image file and saves the output in a TXT file:
using GdPictureImaging gdpictureImaging = new GdPictureImaging();using GdPictureOCR gdpictureOCR = new GdPictureOCR();// Select the image to read.int imageId = gdpictureImaging.CreateGdPictureImageFromFile(@"C:\temp\source.png");// Configure the OCR parameters.gdpictureOCR.SetImage(imageId);gdpictureOCR.ResourceFolder = @"C:\GdPicture.NET 14\Redist\OCR";gdpictureOCR.AddLanguage(OCRLanguage.English);// Run the OCR process.string resultId = gdpictureOCR.RunOCR();// Get the result of the OCR process as text.string outputText = gdpictureOCR.GetOCRResultText(resultId);// Save the output in a new text file.System.IO.StreamWriter outputFile = new System.IO.StreamWriter(@"/Users/manototh/Documents/windows/temp/output.txt");outputFile.WriteLine(outputText);outputFile.Close();// Release unnecessary resources.gdpictureImaging.ReleaseGdPictureImage(imageId);
Using gdpictureImaging As GdPictureImaging = New GdPictureImaging()Using gdpictureOCR As GdPictureOCR = New GdPictureOCR() ' Select the image to read. Dim imageId As Integer = gdpictureImaging.CreateGdPictureImageFromFile("C:\temp\source.png") ' Configure the OCR parameters. gdpictureOCR.SetImage(imageId) gdpictureOCR.ResourceFolder = "C:\GdPicture.NET 14\Redist\OCR" gdpictureOCR.AddLanguage(OCRLanguage.English) ' Run the OCR process. Dim resultId As String = gdpictureOCR.RunOCR() ' Get the result of the OCR process as text. Dim outputText As String = gdpictureOCR.GetOCRResultText(resultId) ' Save the output in a new text file. Dim outputFile As StreamWriter = New StreamWriter("/Users/manototh/Documents/windows/temp/output.txt") outputFile.WriteLine(outputText) outputFile.Close() ' Release unnecessary resources. gdpictureImaging.ReleaseGdPictureImage(imageId)End UsingEnd Using
Used methods and properties
Related topics
Reading text from multipage TIFF files
To read text from a multipage TIFF file, follow the steps below:
- Create a
GdPictureImaging
object and aGdPictureOCR
object. - Select the image by passing its path to the
TiffCreateMultiPageFromFile
method of theGdPictureImaging
object. - Configure the OCR process with the
GdPictureOCR
object in the following way:- Set the path to the OCR resource folder with the
ResourceFolder
property. The default language resources are located inGdPicture.NET 14\Redist\OCR
. For more information on adding language resources, see the language support guide. - With the
AddLanguage
method, add the language resources that Nutrient .NET SDK uses to recognize text in the image. This method takes an element of theOCRLanguage
enum. - Optional: Set whether OCR prioritizes recognition accuracy or speed with the
OCRMode
property. - Optional: Set the character allowlist with the
CharacterSet
property. When scanning the image, the OCR engine only recognizes the characters included in the allowlist. - Optional: Set the character denylist with the
CharacterBlackList
property. When scanning the image, the OCR engine doesn’t recognize the characters included in the denylist.
- Set the path to the OCR resource folder with the
- Create an empty string where you’ll save the output.
- Determine the number of pages with the
GetPageCount
method of theGdPictureImaging
object and loop through them. - Select a page with the
TiffSelectPage
method of theGdPictureImaging
object. - Pass the page to the
GdPictureOCR
object with theSetImage
method of theGdPictureOCR
object. - Run the OCR process with the
RunOCR
method of theGdPictureOCR
object. - Get the result of the OCR process as text with the
GetOCRResultText
method of theGdPictureOCR
object, and save it in the output string. - Release the OCR result with the
ReleaseOCRResult
method of theGdPictureOCR
object. - After reading all the pages, save the output in a new text file with the standard
System.IO.StreamWriter
class. - Release unnecessary resources.
The example below reads text from a multipage TIFF file and saves the output in a TXT file:
using GdPictureImaging gdpictureImaging = new GdPictureImaging();using GdPictureOCR gdpictureOCR = new GdPictureOCR();// Select the image to read.int imageId = gdpictureImaging.TiffCreateMultiPageFromFile(@"C:\temp\source.tif");// Configure the OCR parameters.gdpictureOCR.ResourceFolder = @"C:\GdPicture.NET 14\Redist\OCR";gdpictureOCR.AddLanguage(OCRLanguage.English);// Create an empty string where you'll save the output.string outputText = "";// Determine the number of pages and loop through them.int pageCount = gdpictureImaging.GetPageCount(imageId);for (int page = 1; page <= pageCount; page++){ // Select a page and pass it to the `GdPictureOCR` object. gdpictureImaging.TiffSelectPage(imageId, page); gdpictureOCR.SetImage(imageId); // Run the OCR process. string resultId = gdpictureOCR.RunOCR(); // Get the result of the OCR process as text. outputText += gdpictureOCR.GetOCRResultText(resultId); // Release the OCR result. gdpictureOCR.ReleaseOCRResult(resultId);}// Save the output in a new text file.System.IO.StreamWriter outputFile = new System.IO.StreamWriter(@"C:\temp\output.txt");outputFile.WriteLine(outputText);outputFile.Close();// Release unnecessary resources.gdpictureImaging.ReleaseGdPictureImage(imageId);
Using gdpictureImaging As GdPictureImaging = New GdPictureImaging()Using gdpictureOCR As GdPictureOCR = New GdPictureOCR() ' Select the image to read. Dim imageId As Integer = gdpictureImaging.TiffCreateMultiPageFromFile("C:\temp\source.tif") ' Configure the OCR parameters. gdpictureOCR.ResourceFolder = "C:\GdPicture.NET 14\Redist\OCR" gdpictureOCR.AddLanguage(OCRLanguage.English) ' Create an empty string where you'll save the output. Dim outputText = "" ' Determine the number of pages and loop through them. Dim pageCount As Integer = gdpictureImaging.GetPageCount(imageId) For page = 1 To pageCount ' Select a page and pass it to the `GdPictureOCR` object. gdpictureImaging.TiffSelectPage(imageId, page) gdpictureOCR.SetImage(imageId) ' Run the OCR process. Dim resultId As String = gdpictureOCR.RunOCR() ' Get the result of the OCR process as text. outputText += gdpictureOCR.GetOCRResultText(resultId) ' Release the OCR result. gdpictureOCR.ReleaseOCRResult(resultId) Next ' Save the output in a new text file. Dim outputFile As StreamWriter = New StreamWriter("C:\temp\output.txt") outputFile.WriteLine(outputText) outputFile.Close() ' Release unnecessary resources. gdpictureImaging.ReleaseGdPictureImage(imageId)End UsingEnd Using
Used methods and properties
Related topics