GdPicture.NET.14
GdPicture14 Namespace / GdPicturePDF Class / OcrPages Method / OcrPages(String,Int32,String,String,String,Single) Method
The page range to be processed, for example, "1;4;5" to process pages 1, 4 and 5 or "1-5;10" to process pages from 1 to 5 and page 10. Set this parameter to "*" to process all pages of the current document.
The number of threads to use, asynchronously. Set this parameter to 0 to let the engine to automatically maximize the performance.
The prefix of the dictionary file to use, for example, "spa" for Spanish, "eng" for English, "fra" for French, etc.

The name of such dictionary file has a predefined format [LANGUAGE].traineddata, where [LANGUAGE] defines the used language. You can find these files within your standard installation usually in the directory @\GdPicture.Net 14\Redist\OCR or you can download additional language dictionary files here.

You can also combine multiple dictionaries with the "+" separator, for instance English with French is "eng+fra".

The path with all installed dictionary files the OCR engine will use. The proper path is usually within your standard installation and it looks like @\GdPicture.Net 14\Redist\OCR. Of course you can specify your own path as well.
So called white list of characters, in other words the restricted recognition characters. It means that the engine returns only the specified characters when processing. For example, if you want to only recognize numeric characters, set this parameter to "0123456789". If you want to only recognize uppercase letters, set it to "ABCDEFGHIJKLMNOPQRSTUVWXYZ". Set this parameter to the empty string to recognize all characters.
The dpi resolution the OCR engine will use. It is recommended to use 300 by default.

A value between 200 and 300 should give optimal results on A4-sized documents. Generally values over 300 will cause excessive memory usage.

Example





In This Topic
OcrPages(String,Int32,String,String,String,Single) Method
In This Topic
Runs the optical character recognition (OCR) on the specified page range of the loaded PDF document using a defined number of threads. You can also set other parameters according to your preferences. The recognized text is added as invisible text on each processed page. The page orientation is automatically detected for each page as well.

This method involves a rasterization process so any existing visible text within the processed pages will become a part of the images of those pages before the OCR process starts. The same applies to the invisible text contained within pages. It is not kept because of the rasterization process, which simply means any invisible text is removed from processed pages before the OCR process starts.

This method is running asynchronously, in other words you have to wait for the OCR process ending before manipulating the document further. You can benefit from using several OCR related events like BeforePageOcr, OcrPagesProgress and OcrPagesDone.

Syntax
'Declaration
 
Public Overloads Function OcrPages( _
   ByVal PageRange As String, _
   ByVal ThreadCount As Integer, _
   ByVal Dictionary As String, _
   ByVal DictionaryPath As String, _
   ByVal CharWhiteList As String, _
   ByVal DPI As Single _
) As GdPictureStatus
public GdPictureStatus OcrPages( 
   string PageRange,
   int ThreadCount,
   string Dictionary,
   string DictionaryPath,
   string CharWhiteList,
   float DPI
)
public function OcrPages( 
    PageRange: String;
    ThreadCount: Integer;
    Dictionary: String;
    DictionaryPath: String;
    CharWhiteList: String;
    DPI: Single
): GdPictureStatus; 
public function OcrPages( 
   PageRange : String,
   ThreadCount : int,
   Dictionary : String,
   DictionaryPath : String,
   CharWhiteList : String,
   DPI : float
) : GdPictureStatus;
public: GdPictureStatus OcrPages( 
   string* PageRange,
   int ThreadCount,
   string* Dictionary,
   string* DictionaryPath,
   string* CharWhiteList,
   float DPI
) 
public:
GdPictureStatus OcrPages( 
   String^ PageRange,
   int ThreadCount,
   String^ Dictionary,
   String^ DictionaryPath,
   String^ CharWhiteList,
   float DPI
) 

Parameters

PageRange
The page range to be processed, for example, "1;4;5" to process pages 1, 4 and 5 or "1-5;10" to process pages from 1 to 5 and page 10. Set this parameter to "*" to process all pages of the current document.
ThreadCount
The number of threads to use, asynchronously. Set this parameter to 0 to let the engine to automatically maximize the performance.
Dictionary
The prefix of the dictionary file to use, for example, "spa" for Spanish, "eng" for English, "fra" for French, etc.

The name of such dictionary file has a predefined format [LANGUAGE].traineddata, where [LANGUAGE] defines the used language. You can find these files within your standard installation usually in the directory @\GdPicture.Net 14\Redist\OCR or you can download additional language dictionary files here.

You can also combine multiple dictionaries with the "+" separator, for instance English with French is "eng+fra".

DictionaryPath
The path with all installed dictionary files the OCR engine will use. The proper path is usually within your standard installation and it looks like @\GdPicture.Net 14\Redist\OCR. Of course you can specify your own path as well.
CharWhiteList
So called white list of characters, in other words the restricted recognition characters. It means that the engine returns only the specified characters when processing. For example, if you want to only recognize numeric characters, set this parameter to "0123456789". If you want to only recognize uppercase letters, set it to "ABCDEFGHIJKLMNOPQRSTUVWXYZ". Set this parameter to the empty string to recognize all characters.
DPI
The dpi resolution the OCR engine will use. It is recommended to use 300 by default.

A value between 200 and 300 should give optimal results on A4-sized documents. Generally values over 300 will cause excessive memory usage.

Return Value

A member of the GdPictureStatus enumeration. If the method has been successfully followed, then the return value is GdPictureStatus.OK.

We strongly recommend always checking this status first.

Remarks
This method is only allowed for use with non-encrypted documents. At the same, be aware that this method is running asynchronously.

Just to inform you that this method uses the GdPicture OCR engine.

This method requires the OCR component to run.

Example
How to convert a TIFF image file (one page or multipage) to a searchable PDF document using multithreading.
Dim gdpicturePDF As New GdPicturePDF()
'Adding the OcrPagesDone event.
AddHandler gdpicturePDF.OcrPagesDone, AddressOf OcrPagesDone
            
Sub OcrPagesDone(status As GdPictureStatus) Handles gdpicturePDF.OcrPagesDone
    'Saving the resulting document when the OCR process is finished.
    If gdpicturePDF.SaveToFile("output.pdf") = GdPictureStatus.OK Then
        MessageBox.Show("The resulting document is saved.", "OcrPages")
    Else
        MessageBox.Show("The resulting document can't be saved. Status: " + gdpicturePDF.GetStat().ToString(), "OcrPages")
    End If
End Sub
            
Dim caption As String = "OcrPages"
Using oGdPictureImaging As New GdPictureImaging()
    Dim imageId As Integer = oGdPictureImaging.CreateGdPictureImageFromFile("image.tif")
    If oGdPictureImaging.GetStat() = GdPictureStatus.OK Then
        If gdpicturePDF.NewPDF() = GdPictureStatus.OK Then
            If oGdPictureImaging.TiffIsMultiPage(imageId) = False Then
                gdpicturePDF.AddImageFromGdPictureImage(imageId, False, True)
            Else
                Dim NumberOfPages As Integer = oGdPictureImaging.TiffGetPageCount(imageId)
                For i As Integer = 1 To NumberOfPages
                    If oGdPictureImaging.TiffSelectPage(imageId, i) = GdPictureStatus.OK Then
                        gdpicturePDF.AddImageFromGdPictureImage(imageId, False, True)
                        If gdpicturePDF.GetStat() <> GdPictureStatus.OK Then
                            Exit For
                        End If
                    Else
                        Exit For
                    End If
                Next
            End If
            If gdpicturePDF.GetStat() = GdPictureStatus.OK Then
                If gdpicturePDF.OcrPages("*", 0, "eng", "C:\GdPicture.NET 14\Redist\OCR", "", 300) = GdPictureStatus.OK Then
                    MessageBox.Show("OcrPages - Done!", caption)
                Else
                    MessageBox.Show("The OCR process has failed. Status: " + gdpicturePDF.GetStat().ToString(), caption)
                End If
            Else
                MessageBox.Show("The process of adding images has failed. Status: " + gdpicturePDF.GetStat().ToString(), caption)
            End If
        Else
            MessageBox.Show("The new document can't be created. Status: " + gdpicturePDF.GetStat().ToString(), caption)
        End If
        oGdPictureImaging.ReleaseGdPictureImage(imageId)
    Else
        MessageBox.Show("The image file can't be loaded. Status: " + oGdPictureImaging.GetStat().ToString(), caption)
    End If
End Using
'Release resources only if all processes are finished.
gdpicturePDF.Dispose()
GdPicturePDF gdpicturePDF = new GdPicturePDF();
//Adding the OcrPagesDone event.
gdpicturePDF.OcrPagesDone += OcrPagesDone;
            
void OcrPagesDone(GdPictureStatus status)
{
    //Saving the resulting document when the OCR process is finished.
    if (gdpicturePDF.SaveToFile("output.pdf") == GdPictureStatus.OK)
        MessageBox.Show("The resulting document is saved.", "OcrPages");
    else
        MessageBox.Show("The resulting document can't be saved. Status: " + gdpicturePDF.GetStat().ToString(), "OcrPages");
}
            
string caption = "OcrPages";
using (GdPictureImaging oGdPictureImaging = new GdPictureImaging())
{
    int imageId = oGdPictureImaging.CreateGdPictureImageFromFile("image.tif");
    if (oGdPictureImaging.GetStat() == GdPictureStatus.OK)
    {
        if (gdpicturePDF.NewPDF() == GdPictureStatus.OK)
        {
            if (oGdPictureImaging.TiffIsMultiPage(imageId) == false)
            {
                gdpicturePDF.AddImageFromGdPictureImage(imageId, false, true);
            }
            else
            {
                int NumberOfPages = oGdPictureImaging.TiffGetPageCount(imageId);
                for (int i = 1; i <= NumberOfPages; i++)
                {
                    if (oGdPictureImaging.TiffSelectPage(imageId, i) == GdPictureStatus.OK)
                    {
                        gdpicturePDF.AddImageFromGdPictureImage(imageId, false, true);
                        if (gdpicturePDF.GetStat() != GdPictureStatus.OK)
                            break;
                    }
                    else
                        break;
                }
            }
            if (gdpicturePDF.GetStat() == GdPictureStatus.OK)
            {
                if (gdpicturePDF.OcrPages("*", 0, "eng", "C:\\GdPicture.NET 14\\Redist\\OCR", "", 300) == GdPictureStatus.OK)
                {
                    MessageBox.Show("OcrPages - Done!", caption);
                }
                else
                    MessageBox.Show("The OCR process has failed. Status: " + gdpicturePDF.GetStat().ToString(), caption);
            }
            else
                MessageBox.Show("The process of adding images has failed. Status: " + gdpicturePDF.GetStat().ToString(), caption);
        }
        else
        {
            MessageBox.Show("The new document can't be created. Status: " + gdpicturePDF.GetStat().ToString(), caption);
        }
        oGdPictureImaging.ReleaseGdPictureImage(imageId);
    }
    else
        MessageBox.Show("The image file can't be loaded. Status: " + oGdPictureImaging.GetStat().ToString(), caption);
}
//Release resources only if all processes are finished.
gdpicturePDF.Dispose();
See Also