Creating a custom document data extraction template in C#

In This Topic

This tutorial describes the steps required to create a custom document data extraction template.

Use case definition

A renowned company hosted a talent contest and awarded certificates of completion to all successful participants.

According to company policies, it is essential to maintain records of all rewards in the rewards archive system.

Due to the significant number of talented participants, this has led to the generation of a substantial volume of documents.

Consequently, the HR department has expressed concerns regarding the labor-intensive nature of manually processing this high volume of data.

In response to this challenge, the engineering department has been tasked with swiftly developing an intelligent data processing system designed to efficiently capture and manage this data.

The most evident solution was to employ XtractFlow to create a tailored data extraction template, which would serve to capture all the necessary information.

An example of certificate of completion - Click on the image to view it in a larger size.

download the input image

Prerequisties

-> Check the Prerequisites page.

Building the document template

A DocumentTemplate object must be created.

This object will represent a document template, which serves as the comprehensive definition for a specific type of document.

It should clearly define a unique identifier, a public name, provide a semantic description, and outline a set of fields to be extracted. In this use case the following information require to be extracted:

The year of certificate delivery.
The person who received the certificate.
The mentor of the student.
The member of the jury.
The achievement of the student.
The postal address of the organism.

XtractFlow Document template generation in csharp	Copy Code
static DocumentTemplate buildOrpalisCertificateTemplate() { return new DocumentTemplate() { Name = "ORPALIS certificate", Identifier = "8843294B-5840-4693-8D2A-C4CF76DB1060", SemanticDescription = "ORPALIS certificate of completion", Fields = new List<TemplateField> { new() { Name = "Year", Format = FieldDataFormat.Number, SemanticDescription = "The year of certificate delivery" }, new() { Name = "Student", Format = FieldDataFormat.Text, SemanticDescription = "The person who received the certificate" }, new() { Name = "Mentor", Format = FieldDataFormat.Text, SemanticDescription = "The mentor of the student" }, new() { Name = "Jury member", Format = FieldDataFormat.Text, SemanticDescription = "The member of the jury" }, new() { Name = "Achievement", Format = FieldDataFormat.Text, SemanticDescription = "The achievement of the student" }, new() { Name = "Organism address", Format = FieldDataFormat.Text, SemanticDescription = "The postal address of the organism", StandardValidationMethods = new[]{ new StandardFieldValidationMethod( StandardFieldValidation.PostalAddressIntegrity) } } } }; }

XtractFlow Document template generation in csharp

Copy Code

static DocumentTemplate buildOrpalisCertificateTemplate()
{
    return new DocumentTemplate()
    {
        Name = "ORPALIS certificate",
        Identifier = "8843294B-5840-4693-8D2A-C4CF76DB1060",
        SemanticDescription = "ORPALIS certificate of completion",
        Fields = new List<TemplateField>
        {
            new()
            {
                Name = "Year",
                Format = FieldDataFormat.Number,
                SemanticDescription = "The year of certificate delivery"
            },
            new()
            {
                Name = "Student",
                Format = FieldDataFormat.Text,
                SemanticDescription = "The person who received the certificate"
            },
            new()
            {
                Name = "Mentor",
                Format = FieldDataFormat.Text,
                SemanticDescription = "The mentor of the student"
            },
            new()
            {
                Name = "Jury member",
                Format = FieldDataFormat.Text,
                SemanticDescription = "The member of the jury"
            },
            new()
            {
                Name = "Achievement",
                Format = FieldDataFormat.Text,
                SemanticDescription = "The achievement of the student"
            },
            new()
            {
                Name = "Organism address",
                Format = FieldDataFormat.Text,
                SemanticDescription = "The postal address of the organism",
                StandardValidationMethods = new[]{ new StandardFieldValidationMethod( StandardFieldValidation.PostalAddressIntegrity) }
            }
        }
    };
}

Building the component

Create a ProcessorComponent object, which is a necessary component for the processor.

This object will encapsulate the document processing workflow's logic.

XtractFlow ProcessorComponent generation in csharp	Copy Code
static ProcessorComponent buildComponent() { return new ProcessorComponent() { EnableClassifier = false, // classification is not required as a single class of documents will be processed. EnableFieldsExtraction = true, // enabling extraction of fields specified from the previously defined template. Templates = new DocumentTemplate[] { buildOrpalisCertificateTemplate() } }; }

XtractFlow ProcessorComponent generation in csharp

Copy Code

        static ProcessorComponent buildComponent()
        {
            return new ProcessorComponent()
            {
                EnableClassifier = false, // classification is not required as a single class of documents will be processed.
                EnableFieldsExtraction = true, // enabling extraction of fields specified from the previously defined template.
                Templates = new DocumentTemplate[] {
                    buildOrpalisCertificateTemplate()
                    }
            };
        }

Processing a document and analyzing results

At this point, it is necessary to instantiate a DocumentProcessor object and invoke the Process method to initiate the inference process.

Subsequently, a ProcessorResult object will be returned, encompassing the processing outcome.

Using XtractFlow DocumentProcessor in csharp	Copy Code
// building the component ProcessorComponent component = buildComponent(); // processing the document ProcessorResult result = new DocumentProcessor().Process("orpalis_certificate.jpg", component); // analyzing results if (result.ExtractedFields != null) { foreach (var item in result.ExtractedFields) { Console.WriteLine($"Field name: '{item.FieldName}' - Field value: '{item.Value}' - Validation state: ({item.ValidationState})"); } }

Using XtractFlow DocumentProcessor in csharp

Copy Code

// building the component
ProcessorComponent component = buildComponent();
// processing the document
ProcessorResult result = new DocumentProcessor().Process("orpalis_certificate.jpg", component);
// analyzing results
if (result.ExtractedFields != null)
{
    foreach (var item in result.ExtractedFields)
    {
        Console.WriteLine($"Field name: '{item.FieldName}' - Field value: '{item.Value}' - Validation state: ({item.ValidationState})");
    }
}

Obtained results:

Field name: 'Year' - Field value: '2023' - Validation state: (Undefined)
Field name: 'Student' - Field value: 'Fabio Escobar' - Validation state: (Undefined)
Field name: 'Mentor' - Field value: 'Loïc Carrère' - Validation state: (Undefined)
Field name: 'Jury member' - Field value: 'Olivier Houssin' - Validation state: (Undefined)
Field name: 'Achievement' - Field value: 'Successfully juggled with 3 bananas' - Validation state: (Undefined)
Field name: 'Organism address' - Field value: '52 Rue de Marclan, 31600 MURET, France' - Validation state: (Valid)

The complete solution

Using XtractFlow to achieve custom data extraction	Copy Code
static void runExtraction() { Configuration.RegisterGdPictureKey("GDPICTURE_KEY"); Configuration.RegisterLLMProvider(new OpenAIProvider(OPENAI_KEY)); Configuration.ResourcesFolder = "resources"; // building the component ProcessorComponent component = buildComponent(); // processing the document ProcessorResult result = new DocumentProcessor().Process("orpalis_certificate.jpg", component); // analyzing results if (result.ExtractedFields != null) { foreach (var item in result.ExtractedFields) { Console.WriteLine($"Field name: '{item.FieldName}' - Field value: '{item.Value}' - Validation state: ({item.ValidationState})"); } } } static ProcessorComponent buildComponent() { return new ProcessorComponent() { EnableClassifier = false, //classification is not required as a single class of documents will be processed. EnableFieldsExtraction = true, //enabling extraction of fields specified from the previously defined template. Templates = new DocumentTemplate[] { buildOrpalisCertificateTemplate() } }; } static DocumentTemplate buildOrpalisCertificateTemplate() { return new DocumentTemplate() { Name = "ORPALIS certificate", Identifier = "8843294B-5840-4693-8D2A-C4CF76DB1060", SemanticDescription = "ORPALIS certificate of completion", Fields = new List<TemplateField> { new() { Name = "Year", Format = FieldDataFormat.Number, SemanticDescription = "The year of certificate delivery" }, new() { Name = "Student", Format = FieldDataFormat.Text, SemanticDescription = "The person who received the certificate" }, new() { Name = "Mentor", Format = FieldDataFormat.Text, SemanticDescription = "The mentor of the student" }, new() { Name = "Jury member", Format = FieldDataFormat.Text, SemanticDescription = "The member of the jury" }, new() { Name = "Achievement", Format = FieldDataFormat.Text, SemanticDescription = "The achievement of the student" }, new() { Name = "Organism address", Format = FieldDataFormat.Text, SemanticDescription = "The postal address of the organism", StandardValidationMethods = new[]{ new StandardFieldValidationMethod( StandardFieldValidation.PostalAddressIntegrity) } } } }; }

Using XtractFlow to achieve custom data extraction

Copy Code

        static void runExtraction()
        {
            Configuration.RegisterGdPictureKey("GDPICTURE_KEY");
            Configuration.RegisterLLMProvider(new OpenAIProvider(OPENAI_KEY));
            Configuration.ResourcesFolder = "resources";
            // building the component
            ProcessorComponent component = buildComponent();
            // processing the document
            ProcessorResult result = new DocumentProcessor().Process("orpalis_certificate.jpg", component);
            // analyzing results
            if (result.ExtractedFields != null)
            {
                foreach (var item in result.ExtractedFields)
                {
                    Console.WriteLine($"Field name: '{item.FieldName}' - Field value: '{item.Value}' - Validation state: ({item.ValidationState})");
                }
            }
        }

        static ProcessorComponent buildComponent()
        {
            return new ProcessorComponent()
            {
                EnableClassifier = false, //classification is not required as a single class of documents will be processed.
                EnableFieldsExtraction = true, //enabling extraction of fields specified from the previously defined template.
                Templates = new DocumentTemplate[] {
                    buildOrpalisCertificateTemplate()
                    }
            };
        }

        static DocumentTemplate buildOrpalisCertificateTemplate()
        {
            return new DocumentTemplate()
            {
                Name = "ORPALIS certificate",
                Identifier = "8843294B-5840-4693-8D2A-C4CF76DB1060",
                SemanticDescription = "ORPALIS certificate of completion",
                Fields = new List<TemplateField>
            {
            new()
            {
                Name = "Year",
                Format = FieldDataFormat.Number,
                SemanticDescription = "The year of certificate delivery"
            },
            new()
            {
                Name = "Student",
                Format = FieldDataFormat.Text,
                SemanticDescription = "The person who received the certificate"
            },
            new()
            {
                Name = "Mentor",
                Format = FieldDataFormat.Text,
                SemanticDescription = "The mentor of the student"
            },
            new()
            {
                Name = "Jury member",
                Format = FieldDataFormat.Text,
                SemanticDescription = "The member of the jury"
            },
            new()
            {
                Name = "Achievement",
                Format = FieldDataFormat.Text,
                SemanticDescription = "The achievement of the student"
            },
            new()
            {
                Name = "Organism address",
                Format = FieldDataFormat.Text,
                SemanticDescription = "The postal address of the organism",
                StandardValidationMethods = new[]{ new StandardFieldValidationMethod( StandardFieldValidation.PostalAddressIntegrity) }
            }
        }
            };
        }

The end!