Read Text from PDFs and Images Using C#

This guide explains how to read text from a PDF or image file. Sometimes, text is stored in a PDF or an image in a way that makes it so you cannot search or copy it. GdPicture.NET’s optical character recognition (OCR) engine allows you to recognize text and save it in a separate file where you can both search and copy and paste the text.

Reading Text from a PDF

To read text from a PDF, follow these steps:

  1. Create a GdPicturePDF object and a GdPictureOCR object.

  2. Select the source document by passing its path to the LoadFromFile method of the GdPicturePDF object.

  3. Configure the OCR process with the GdPictureOCR object in the following way:

    • Set the path to the OCR resource folder with the ResourceFolder property. The default language resources are located in GdPicture.NET 14\Redist\OCR. For more information on adding language resources, see the language support guide.

    • With the AddLanguage method, add the language resources that GdPicture.NET uses to recognize text in the image. This method takes an element of the OCRLanguage enum.

    • Optional: Set whether OCR prioritizes recognition accuracy or speed with the OCRMode property.

    • Optional: Set the character allowlist with the CharacterSet property. When scanning the image, the OCR engine only recognizes the characters included in the allowlist.

    • Optional: Set the character denylist with the CharacterBlackList property. When scanning the image, the OCR engine doesn’t recognize the characters included in the denylist.

  4. Create an empty string where you’ll save the output.

  5. Determine the number of pages with the GetPageCount method of the GdPicturePDF object and loop through them.

  6. Render each page to a 300 dots-per-inch (DPI) image with the RenderPageToGdPictureImageEx method of the GdPicturePDF object.

  7. Pass the image to the GdPictureOCR object with the SetImage method of the GdPictureOCR object.

  8. Run the OCR process with the RunOCR method of the GdPictureOCR object.

  9. Get the result of the OCR process as text with the GetOCRResultText method of the GdPictureOCR object, and save it in the output string.

  10. Release the image with the DisposeImage method of the GdPictureDocumentUtilities class, and release the OCR result with the ReleaseOCRResult method of the GdPictureOCR object.

  11. After reading all the pages, save the output in a new text file with the standard System.IO.StreamWriter class.

  12. Release unnecessary resources.

The example below reads text from a PDF and saves the output in a TXT file:

using GdPicturePDF gdpicturePDF = new GdPicturePDF();
using GdPictureOCR gdpictureOCR = new GdPictureOCR();
// Select the source document.
gdpicturePDF.LoadFromFile(@"C:\temp\source.pdf");
// Configure the OCR process.
gdpictureOCR.ResourceFolder = @"C:\GdPicture.NET 14\Redist\OCR";
gdpictureOCR.AddLanguage(OCRLanguage.English);
// Create an empty string where you'll save the output.
string outputText = "";
// Determine the number of pages and loop through them.
int pageCount = gdpicturePDF.GetPageCount();
for (int page = 1; page <= pageCount; page++)
{
    gdpicturePDF.SelectPage(page);
    // Render the page to a 300 DPI image.
    int imageId = gdpicturePDF.RenderPageToGdPictureImageEx(300, true);
    // Pass the image to the `GdPictureOCR` object.
    gdpictureOCR.SetImage(imageId);
    // Run the OCR process.
    string resultId = gdpictureOCR.RunOCR();
    // Get the result of the OCR process as text.
    outputText += gdpictureOCR.GetOCRResultText(resultId);
    // Release the image and the OCR result.
    GdPictureDocumentUtilities.DisposeImage(imageId);
    gdpictureOCR.ReleaseOCRResult(resultId);
}
// Save the output in a new text file.
System.IO.StreamWriter outputFile = new System.IO.StreamWriter(@"C:\temp\output.txt");
outputFile.WriteLine(outputText);
outputFile.Close();
// Release unnecessary resources.
gdpicturePDF.CloseDocument();
Using gdpicturePDF As GdPicturePDF = New GdPicturePDF()
Using gdpictureOCR As GdPictureOCR = New GdPictureOCR()
    ' Select the source document.
    gdpicturePDF.LoadFromFile("C:\temp\source.pdf")
    ' Configure the OCR process.
    gdpictureOCR.ResourceFolder = "C:\GdPicture.NET 14\Redist\OCR"
    gdpictureOCR.AddLanguage(OCRLanguage.English)
    ' Create an empty string where you'll save the output.
    Dim outputText = ""
    ' Determine the number of pages and loop through them.
    Dim pageCount As Integer = gdpicturePDF.GetPageCount()
    For page = 1 To pageCount
        gdpicturePDF.SelectPage(page)
        ' Render the page to a 300 DPI image.
        Dim imageId As Integer = gdpicturePDF.RenderPageToGdPictureImageEx(300, True)
        ' Pass the image to the `GdPictureOCR` object.
        gdpictureOCR.SetImage(imageId)
        ' Run the OCR process.
        Dim resultId As String = gdpictureOCR.RunOCR()
        ' Get the result of the OCR process as text.
        outputText += gdpictureOCR.GetOCRResultText(resultId)
        ' Release the image and the OCR result.
        GdPictureDocumentUtilities.DisposeImage(imageId)
        gdpictureOCR.ReleaseOCRResult(resultId)
    Next
    ' Save the output in a new text file.
    Dim outputFile As StreamWriter = New StreamWriter("C:\temp\output.txt")
    outputFile.WriteLine(outputText)
    outputFile.Close()
    ' Release unnecessary resources.
    gdpicturePDF.CloseDocument()
End Using
End Using
Used Methods and Properties

Related Topics

Reading Text from an Image

This section explains how to read text from simple, single-page image files. For more information on reading multipage image files, see Reading Text from Multipage TIFF Files.

To read text from an image file, follow these steps:

  1. Create a GdPictureImaging object and a GdPictureOCR object.

  2. Select the image by passing its path to the CreateGdPictureImageFromFile method of the GdPictureImaging object.

  3. Configure the OCR process with the GdPictureOCR object in the following way:

    • Set the image with the SetImage method.

    • Set the path to the OCR resource folder with the ResourceFolder property. The default language resources are located in GdPicture.NET 14\Redist\OCR. For more information on adding language resources, see the language support guide.

    • With the AddLanguage method, add the language resources that GdPicture.NET uses to recognize text in the image. This method takes a member of the OCRLanguage enumeration.

    • Optional: Set whether OCR prioritizes recognition accuracy or speed with the OCRMode property.

    • Optional: Set the character allowlist with the CharacterSet property. When scanning the image, the OCR engine only recognizes the characters included in the allowlist.

    • Optional: Set the character denylist with the CharacterBlackList property. When scanning the image, the OCR engine doesn’t recognize the characters included in the denylist.

  4. Run the OCR process with the RunOCR method of the GdPictureOCR object.

  5. Get the result of the OCR process as text with the GetOCRResultText method of the GdPictureOCR object.

  6. Save the output in a new text file with the standard System.IO.StreamWriter class.

  7. Release unnecessary resources.

The example below reads text from an image file and saves the output in a TXT file:

using GdPictureImaging gdpictureImaging = new GdPictureImaging();
using GdPictureOCR gdpictureOCR = new GdPictureOCR();
// Select the image to read.
int imageId = gdpictureImaging.CreateGdPictureImageFromFile(@"C:\temp\source.png");
// Configure the OCR parameters.
gdpictureOCR.SetImage(imageId);
gdpictureOCR.ResourceFolder = @"C:\GdPicture.NET 14\Redist\OCR";
gdpictureOCR.AddLanguage(OCRLanguage.English);
// Run the OCR process.
string resultId = gdpictureOCR.RunOCR();
// Get the result of the OCR process as text.
string outputText = gdpictureOCR.GetOCRResultText(resultId);
// Save the output in a new text file.
System.IO.StreamWriter outputFile = new System.IO.StreamWriter(@"/Users/manototh/Documents/windows/temp/output.txt");
outputFile.WriteLine(outputText);
outputFile.Close();
// Release unnecessary resources.
gdpictureImaging.ReleaseGdPictureImage(imageId);
Using gdpictureImaging As GdPictureImaging = New GdPictureImaging()
Using gdpictureOCR As GdPictureOCR = New GdPictureOCR()
    ' Select the image to read.
    Dim imageId As Integer = gdpictureImaging.CreateGdPictureImageFromFile("C:\temp\source.png")
    ' Configure the OCR parameters.
    gdpictureOCR.SetImage(imageId)
    gdpictureOCR.ResourceFolder = "C:\GdPicture.NET 14\Redist\OCR"
    gdpictureOCR.AddLanguage(OCRLanguage.English)
    ' Run the OCR process.
    Dim resultId As String = gdpictureOCR.RunOCR()
    ' Get the result of the OCR process as text.
    Dim outputText As String = gdpictureOCR.GetOCRResultText(resultId)
    ' Save the output in a new text file.
    Dim outputFile As StreamWriter = New StreamWriter("/Users/manototh/Documents/windows/temp/output.txt")
    outputFile.WriteLine(outputText)
    outputFile.Close()
    ' Release unnecessary resources.
    gdpictureImaging.ReleaseGdPictureImage(imageId)
End Using
End Using
Used Methods and Properties

Related Topics

Reading Text from Multipage TIFF Files

To read text from a multipage TIFF file, follow these steps:

  1. Create a GdPictureImaging object and a GdPictureOCR object.

  2. Select the image by passing its path to the TiffCreateMultiPageFromFile method of the GdPictureImaging object.

  3. Configure the OCR process with the GdPictureOCR object in the following way:

    • Set the path to the OCR resource folder with the ResourceFolder property. The default language resources are located in GdPicture.NET 14\Redist\OCR. For more information on adding language resources, see the language support guide.

    • With the AddLanguage method, add the language resources that GdPicture.NET uses to recognize text in the image. This method takes an element of the OCRLanguage enum.

    • Optional: Set whether OCR prioritizes recognition accuracy or speed with the OCRMode property.

    • Optional: Set the character allowlist with the CharacterSet property. When scanning the image, the OCR engine only recognizes the characters included in the allowlist.

    • Optional: Set the character denylist with the CharacterBlackList property. When scanning the image, the OCR engine doesn’t recognize the characters included in the denylist.

  4. Create an empty string where you’ll save the output.

  5. Determine the number of pages with the GetPageCount method of the GdPictureImaging object and loop through them.

  6. Select a page with the TiffSelectPage method of the GdPictureImaging object.

  7. Pass the page to the GdPictureOCR object with the SetImage method of the GdPictureOCR object.

  8. Run the OCR process with the RunOCR method of the GdPictureOCR object.

  9. Get the result of the OCR process as text with the GetOCRResultText method of the GdPictureOCR object, and save it in the output string.

  10. Release the OCR result with the ReleaseOCRResult method of the GdPictureOCR object.

  11. After reading all the pages, save the output in a new text file with the standard System.IO.StreamWriter class.

  12. Release unnecessary resources.

The example below reads text from a multipage TIFF file and saves the output in a TXT file:

using GdPictureImaging gdpictureImaging = new GdPictureImaging();
using GdPictureOCR gdpictureOCR = new GdPictureOCR();
// Select the image to read.
int imageId = gdpictureImaging.TiffCreateMultiPageFromFile(@"C:\temp\source.tif");
// Configure the OCR parameters.
gdpictureOCR.ResourceFolder = @"C:\GdPicture.NET 14\Redist\OCR";
gdpictureOCR.AddLanguage(OCRLanguage.English);
// Create an empty string where you'll save the output.
string outputText = "";
// Determine the number of pages and loop through them.
int pageCount = gdpictureImaging.GetPageCount(imageId);
for (int page = 1; page <= pageCount; page++)
{
    // Select a page and pass it to the `GdPictureOCR` object.
    gdpictureImaging.TiffSelectPage(imageId, page);
    gdpictureOCR.SetImage(imageId);
    // Run the OCR process.
    string resultId = gdpictureOCR.RunOCR();
    // Get the result of the OCR process as text.
    outputText += gdpictureOCR.GetOCRResultText(resultId);
    // Release the OCR result.
    gdpictureOCR.ReleaseOCRResult(resultId);
}
// Save the output in a new text file.
System.IO.StreamWriter outputFile = new System.IO.StreamWriter(@"C:\temp\output.txt");
outputFile.WriteLine(outputText);
outputFile.Close();
// Release unnecessary resources.
gdpictureImaging.ReleaseGdPictureImage(imageId);
Using gdpictureImaging As GdPictureImaging = New GdPictureImaging()
Using gdpictureOCR As GdPictureOCR = New GdPictureOCR()
    ' Select the image to read.
    Dim imageId As Integer = gdpictureImaging.TiffCreateMultiPageFromFile("C:\temp\source.tif")
    ' Configure the OCR parameters.
    gdpictureOCR.ResourceFolder = "C:\GdPicture.NET 14\Redist\OCR"
    gdpictureOCR.AddLanguage(OCRLanguage.English)
    ' Create an empty string where you'll save the output.
    Dim outputText = ""
    ' Determine the number of pages and loop through them.
    Dim pageCount As Integer = gdpictureImaging.GetPageCount(imageId)
    For page = 1 To pageCount
        ' Select a page and pass it to the `GdPictureOCR` object.
        gdpictureImaging.TiffSelectPage(imageId, page)
        gdpictureOCR.SetImage(imageId)
        ' Run the OCR process.
        Dim resultId As String = gdpictureOCR.RunOCR()
        ' Get the result of the OCR process as text.
        outputText += gdpictureOCR.GetOCRResultText(resultId)
        ' Release the OCR result.
        gdpictureOCR.ReleaseOCRResult(resultId)
    Next
    ' Save the output in a new text file.
    Dim outputFile As StreamWriter = New StreamWriter("C:\temp\output.txt")
    outputFile.WriteLine(outputText)
    outputFile.Close()
    ' Release unnecessary resources.
    gdpictureImaging.ReleaseGdPictureImage(imageId)
End Using
End Using
Used Methods and Properties

Related Topics