Extract Data from Invoices Using C#

GdPicture.NET’s key-value pair (KVP) extraction engine enables you to recognize related data items in a document and export them to an external destination like a spreadsheet.

To extract data items from an invoice, follow these steps:

  1. Create a GdPictureOCR object and a GdPictureImaging object.

  2. Select the invoice by passing its path to the CreateGdPictureImageFromFile method of the GdPictureImaging object.

  3. Configure the OCR process with the GdPictureOCR object in the following way:

    • Set the invoice with the SetImage method.

    • Set the path to the OCR resource folder with the ResourceFolder property. The default language resources are located in GdPicture.NET 14\Redist\OCR. For more information on adding language resources, see the language support guide.

    • With the AddLanguage method, add the language resources that GdPicture.NET uses to recognize text in the image. This method takes a member of the OCRLanguage enumeration.

  4. Run the OCR process with the RunOCR method of the GdPictureOCR object.

  5. Get the number of key-value pairs detected during the OCR process with the GetKeyValuePairCount method of the GdPictureOCR object, and loop through them.

  6. Get the key-value pairs, the data types, and the confidence scores with the following methods:

  7. Write the output to the console.

  8. Release unnecessary resources.

The example below retrieves key-value pairs from the following invoice.

Sample invoice

Download the sample invoice and run the code below, or check out our demo.

using GdPictureOCR gdpictureOCR = new GdPictureOCR();
using GdPictureImaging gdpictureImaging = new GdPictureImaging();
// Load the source document.
int imageId = gdpictureImaging.CreateGdPictureImageFromFile(@"C:\temp\source.png");
// Configure the OCR process.
gdpictureOCR.ResourceFolder = @"C:\GdPicture.NET 14\Redist\OCR";
gdpictureOCR.AddLanguage(OCRLanguage.English);
gdpictureOCR.SetImage(imageId);
// Run the OCR process.
string ocrResultId = gdpictureOCR.RunOCR();
string keyValuePairsData = "";
for (int pairIndex = 0; pairIndex < gdpictureOCR.GetKeyValuePairCount(ocrResultId); pairIndex++)
{
    keyValuePairsData += $"| Key: {gdpictureOCR.GetKeyValuePairKeyString(ocrResultId, pairIndex)} | " +
                         $"Value: {gdpictureOCR.GetKeyValuePairValueString(ocrResultId, pairIndex)} | " +
                         $"Document Type: {gdpictureOCR.GetKeyValuePairDataType(ocrResultId, pairIndex).ToString()} | " + 
                         $"Confidence Level: {Math.Round(gdpictureOCR.GetKeyValuePairConfidence(ocrResultId, pairIndex), 1).ToString()}% |\n";
}
// Write the output to the console.
Console.WriteLine(keyValuePairsData);
// Release unnecessary resources.
gdpictureImaging.ReleaseGdPictureImage(imageId);
gdpictureOCR.ReleaseOCRResults();
Using gdpictureOCR As GdPictureOCR = New GdPictureOCR()
Using gdpictureImaging As GdPictureImaging = New GdPictureImaging()
    ' Load the source document.
    Dim imageId As Integer = gdpictureImaging.CreateGdPictureImageFromFile("C:\temp\source.png")
    ' Configure the OCR process.
    gdpictureOCR.ResourceFolder = "C:\GdPicture.NET 14\Redist\OCR"
    gdpictureOCR.AddLanguage(OCRLanguage.English)
    gdpictureOCR.SetImage(imageId)
    ' Run the OCR process.
    Dim ocrResultId As String = gdpictureOCR.RunOCR()
    Dim keyValuePairsData = ""
    For pairIndex As Integer = 0 To gdpictureOCR.GetKeyValuePairCount(ocrResultId) - 1
        keyValuePairsData += $"| Key: {gdpictureOCR.GetKeyValuePairKeyString(ocrResultId, pairIndex)} | Value: {gdpictureOCR.GetKeyValuePairValueString(ocrResultId, pairIndex)} | Document Type: {gdpictureOCR.GetKeyValuePairDataType(ocrResultId, pairIndex).ToString()} | Confidence Level: {Math.Round(gdpictureOCR.GetKeyValuePairConfidence(CStr(ocrResultId), CInt(pairIndex)), CInt(1)).ToString()}% |" & vbLf
    Next
    ' Write the output to the console.
    Console.WriteLine(keyValuePairsData)
    ' Release unnecessary resources.
    gdpictureImaging.ReleaseGdPictureImage(imageId)
    gdpictureOCR.ReleaseOCRResults()
End Using
End Using
Used Methods and Properties

Related Topics

Format the output to obtain the following table:

Key Value Document Type Confidence Level
Billing date 20/09/2022 DateTime 100%
Order date 20/09/2022 DateTime 100%
Republic of PDF +100 847 738 227 PhoneNumber 77.2%
IBAN AT13 2060 4236 6111 5994 IBAN 100%
Customer Vandelay Industries Around the Corner 13 NBC City String 69.8%
Delivery address Vandelay Industries Around the Corner 13 NBC City String 69.9%
Invoice number No 00162 String 70.9%
Ref. number 34751 Number 92.9%
No 00162 Number 100%
Reference P00201 UID 100%
Quantity Total (excl. VAT) 320.00€ Currency 59%
Subtotal 1,220.00€ Currency 100%
Discount (10%) -122.00€ Currency 90.6%
VAT (5.5%) +6710€ Currency 66.9%
Shipping cost 0.00€ Currency 75%
TOTAL 1,165.10€ Currency 100%
Description Lake Mirror String 99.6%
VAT 5.5% Percentage 66.6%
Price per unit (excl. VAT) 320.00€ Currency 80%
Tax No. AT98765321 UID 73.8%
# [email protected] EmailAddress 65.6%
# www.bruuuk.com URL 65.6%