Invoice recognition and data extraction
This guide describes the steps required to create a customized invoice processing solution for XtractFlow’s AI-powered technology.
Example use case
The manual management of data extraction from large volumes of invoices (of varying formats) can be a daunting, error-prone, and time-consuming task, and it can lead to delays in processing, along with missed opportunities.
This guide demonstrates how the extraction of both structured and unstructured data from different types of invoices can be automated to streamline inventory management processes. By leveraging the XtractFlow SDK, accurate inventory records can be maintained, informed procurement decisions can be made, and supply chain operations can be optimized.
Download the example input document.
Configuring the document template
A document template serves as a comprehensive definition for a specific type of document. XtractFlow comes with a set of predefined templates for different document types, one of which is an invoice.
The invoice template has the following predefined fields, covering the most common fields found in invoices:
-
Invoice number
-
Date of invoice
-
Due date
-
Customer name
-
Customer address
-
Vendor name
-
Vendor address
-
Total VAT excluded
-
Total VAT included
-
VAT percentage
-
VAT amount
-
Currency
For this use case, four additional fields are also required:
-
The order ID
-
The payment due date
-
The total amount of unique items
-
The total quantity of all items combined
To set up the document template for the invoice, begin by obtaining the predefined Invoice
template. Then, add the four new fields to it by providing a clear semantic description for each field:
static DocumentTemplate buildInvoiceTemplate() { // Template setup: getting an instance of the preconfigured invoice template. DocumentTemplate invoiceTemplate = DocumentTemplates.Invoice; // Adding a custom field to the invoice template instance // to get the order ID of a specific format. invoiceTemplate.AddField(new() { Name = "Order ID", Format = FieldDataFormat.Text, SemanticDescription = "The order ID in the invoice", RegexValidationMethods = new List<RegexFieldValidationMethod> { new RegexFieldValidationMethod("^[A-Z][0-9]{1,6}$") } }); // Adding a custom field to the invoice template instance // to calculate the payment due date based on the information in the invoice. invoiceTemplate.AddField(new() { Name = "Payment due date", Format = FieldDataFormat.Text, SemanticDescription = "The date that the payment is due", StandardValidationMethods = new List<StandardFieldValidationMethod>() { new StandardFieldValidationMethod(StandardFieldValidation.DateIntegrity) } }); // Adding a custom field to the template instance // to get the total number of unique item count. invoiceTemplate.AddField(new() { Name = "Unique item count", Format = FieldDataFormat.Number, SemanticDescription = "The total number of unique items", StandardValidationMethods = new List<StandardFieldValidationMethod>() { new StandardFieldValidationMethod(StandardFieldValidation.NumberIntegrity) } }); // Adding a custom field to the template instance // to get the sum total of all items. invoiceTemplate.AddField(new() { Name = "Total item count", Format = FieldDataFormat.Number, SemanticDescription = "The sum of all items, including multiplying the quantities of each item", StandardValidationMethods = new List<StandardFieldValidationMethod>() { new StandardFieldValidationMethod(StandardFieldValidation.NumberIntegrity) } }); return invoiceTemplate; }
Building the component
Create a ProcessorComponent
object, which is a necessary component for the processor. This object will encapsulate the document processing workflow’s logic:
static ProcessorComponent buildComponent() { return new ProcessorComponent() { EnableClassifier = false, // Classification is not required, as you're using a single template. EnableFieldsExtraction = true, // Enabling extraction of fields specified from the templates defined in the "Templates" field below. Templates = new DocumentTemplate[] { buildInvoiceTemplate() } }; }
Processing a document and analyzing results
At this point, it’s necessary to instantiate a DocumentProcessor
object and invoke the Process
method to initiate the inference process.
Subsequently, a ProcessorResult
object will be returned, encompassing the processing outcome:
// Process the document. ProcessorResult result = new DocumentProcessor().Process(sourceFile, component); // Analyze results. if (result.ExtractedFields != null) { foreach (var item in result.ExtractedFields) { Console.WriteLine($"Field name: {item.FieldName} | Field value: {item.Value} | Validation state: ({item.ValidationState})"); } }
Results output
Field name: 'Invoice number' | Field value: '1000876' | Validation state: (Undefined) Field name: 'Date emission' | Field value: '14/08/2023' | Validation state: (Valid) Field name: 'Due date' | Field value: '13/09/2023' | Validation state: (Valid) Field name: 'Customer name' | Field value: 'Roger COMPANY' | Validation state: (Undefined) Field name: 'Customer address' | Field value: '100 Mighty Bay, 125863 Rome, IT' | Validation state: (Valid) Field name: 'Vendor name' | Field value: 'Rabbit STORE' | Validation state: (Undefined) Field name: 'Vendor address' | Field value: '255 Commercial Street, 25880 New York, US' | Validation state: (Valid) Field name: 'Total VAT excluded' | Field value: '1750' | Validation state: (Valid) Field name: 'Total VAT included' | Field value: '1925' | Validation state: (Valid) Field name: 'VAT percentage' | Field value: '10' | Validation state: (Undefined) Field name: 'VAT amount' | Field value: '175' | Validation state: (Valid) Field name: 'Currency' | Field value: 'USD' | Validation state: (Valid) Field name: 'Order ID' | Field value: 'X001525' | Validation state: (Valid) Field name: 'Payment due date' | Field value: '2023-09-13' | Validation state: (Valid) Field name: 'Unique item count' | Field value: '4' | Validation state: (Valid) Field name: 'Total item count' | Field value: '17' | Validation state: (Valid)
The results clearly demonstrate that the extraction process not only successfully retrieved data from the invoice, but it also intelligently inferred additional details — such as the item count and the payment due date — by synthesizing various pieces of information found within the invoice.
The complete solution
static void RunExtraction() { Configuration.RegisterGdPictureKey("GDPICTURE_KEY"); Configuration.RegisterLLMProvider(new OpenAIProvider("OPENAI_API_KEY")); Configuration.ResourcesFolder = "resources"; // Building the component. ProcessorComponent component = buildComponent(); // Process the document. ProcessorResult result = new DocumentProcessor().Process("invoice.pdf", component); // Analyze results. if (result.ExtractedFields != null) { foreach (var item in result.ExtractedFields) { Console.WriteLine($"Field name: {item.FieldName} | Field value: {item.Value} | Validation state: ({item.ValidationState})"); } } } static ProcessorComponent buildComponent() { return new ProcessorComponent() { EnableClassifier = false, // Classification is not required, as you're using a single template. EnableFieldsExtraction = true, // Enabling extraction of fields specified from the templates defined in the "Templates" field below. Templates = new DocumentTemplate[] { buildInvoiceTemplate() } }; } static DocumentTemplate buildInvoiceTemplate() { // Template setup: getting an instance of the preconfigured invoice template. DocumentTemplate invoiceTemplate = DocumentTemplates.Invoice; // Adding a custom field to the invoice template instance // to get the order ID of a specific format. invoiceTemplate.AddField(new() { Name = "Order ID", Format = FieldDataFormat.Text, SemanticDescription = "The order ID in the invoice", RegexValidationMethods = new List<RegexFieldValidationMethod> { new RegexFieldValidationMethod("^[A-Z][0-9]{1,6}$") } }); // Adding a custom field to the invoice template instance // to calculate the payment due date based on the information in the invoice. invoiceTemplate.AddField(new() { Name = "Payment due date", Format = FieldDataFormat.Text, SemanticDescription = "The date that the payment is due", StandardValidationMethods = new List<StandardFieldValidationMethod>() { new StandardFieldValidationMethod(StandardFieldValidation.DateIntegrity) } }); // Adding a custom field to the template instance // to get the total number of unique item count. invoiceTemplate.AddField(new() { Name = "Unique item count", Format = FieldDataFormat.Number, SemanticDescription = "The total number of unique item", StandardValidationMethods = new List<StandardFieldValidationMethod>() { new StandardFieldValidationMethod(StandardFieldValidation.NumberIntegrity) } }); // Adding a custom field to the template instance // to get the sum total of all items. invoiceTemplate.AddField(new() { Name = "Total item count", Format = FieldDataFormat.Number, SemanticDescription = "The sum of all items, including multiplying the quantities of each item", StandardValidationMethods = new List<StandardFieldValidationMethod>() { new StandardFieldValidationMethod(StandardFieldValidation.NumberIntegrity) } }); return invoiceTemplate; }