for GdPicture.NET
Getting Started / .NET toturials / Creating a custom document data extraction template in C#
In This Topic

    Creating a custom document data extraction template in C#

    In This Topic

    This tutorial describes the steps required to create a custom document data extraction template.


    Use case definition

    A renowned company hosted a talent contest and awarded certificates of completion to all successful participants.

    According to company policies, it is essential to maintain records of all rewards in the rewards archive system.

    Due to the significant number of talented participants, this has led to the generation of a substantial volume of documents.

    Consequently, the HR department has expressed concerns regarding the labor-intensive nature of manually processing this high volume of data.

    In response to this challenge, the engineering department has been tasked with swiftly developing an intelligent data processing system designed to efficiently capture and manage this data.

    The most evident solution was to employ XtractFlow to create a tailored data extraction template, which would serve to capture all the necessary information.

     

    An example of certificate of completion - Click on the image to view it in a larger size.

    An example of certificate of completion - Click on the image to view it in a larger size.

    download the input image


    Prerequisties

     -> Check the Prerequisites page.


    Building the document template

    A DocumentTemplate object must be created.

    This object will represent a document template, which serves as the comprehensive definition for a specific type of document.

    It should clearly define a unique identifier, a public name, provide a semantic description, and outline a set of fields to be extracted. In this use case the following information require to be extracted:

     

    XtractFlow Document template generation in csharp

    Copy Code
    static DocumentTemplate buildOrpalisCertificateTemplate()
    {
        return new DocumentTemplate()
        {
            Name = "ORPALIS certificate",
            Identifier = "8843294B-5840-4693-8D2A-C4CF76DB1060",
            SemanticDescription = "ORPALIS certificate of completion",
            Fields = new List<TemplateField>
            {
                new()
                {
                    Name = "Year",
                    Format = FieldDataFormat.Number,
                    SemanticDescription = "The year of certificate delivery"
                },
                new()
                {
                    Name = "Student",
                    Format = FieldDataFormat.Text,
                    SemanticDescription = "The person who received the certificate"
                },
                new()
                {
                    Name = "Mentor",
                    Format = FieldDataFormat.Text,
                    SemanticDescription = "The mentor of the student"
                },
                new()
                {
                    Name = "Jury member",
                    Format = FieldDataFormat.Text,
                    SemanticDescription = "The member of the jury"
                },
                new()
                {
                    Name = "Achievement",
                    Format = FieldDataFormat.Text,
                    SemanticDescription = "The achievement of the student"
                },
                new()
                {
                    Name = "Organism address",
                    Format = FieldDataFormat.Text,
                    SemanticDescription = "The postal address of the organism",
                    StandardValidationMethods = new[]{ new StandardFieldValidationMethod( StandardFieldValidation.PostalAddressIntegrity) }
                }
            }
        };
    }

    Building the component

    Create a ProcessorComponent object, which is a necessary component for the processor.

    This object will encapsulate the document processing workflow's logic.

     

    XtractFlow ProcessorComponent generation in csharp
    Copy Code
            static ProcessorComponent buildComponent()
            {
                return new ProcessorComponent()
                {
                    EnableClassifier = false, // classification is not required as a single class of documents will be processed.
                    EnableFieldsExtraction = true, // enabling extraction of fields specified from the previously defined template.
                    Templates = new DocumentTemplate[] {
                        buildOrpalisCertificateTemplate()
                        }
                };
            }

    Processing a document and analyzing results

    At this point, it is necessary to instantiate a DocumentProcessor object and invoke the Process method to initiate the inference process.

    Subsequently, a ProcessorResult object will be returned, encompassing the processing outcome.

     

    Using XtractFlow DocumentProcessor in csharp
    Copy Code
    // building the component
    ProcessorComponent component = buildComponent();
    // processing the document
    ProcessorResult result = new DocumentProcessor().Process("orpalis_certificate.jpg", component);
    // analyzing results
    if (result.ExtractedFields != null)
    {
        foreach (var item in result.ExtractedFields)
        {
            Console.WriteLine($"Field name: '{item.FieldName}' - Field value: '{item.Value}' - Validation state: ({item.ValidationState})");
        }
    }

    Obtained results:

    Field name: 'Year' - Field value: '2023' - Validation state: (Undefined)
    Field name: 'Student' - Field value: 'Fabio Escobar' - Validation state: (Undefined)
    Field name: 'Mentor' - Field value: 'Loïc Carrère' - Validation state: (Undefined)
    Field name: 'Jury member' - Field value: 'Olivier Houssin' - Validation state: (Undefined)
    Field name: 'Achievement' - Field value: 'Successfully juggled with 3 bananas' - Validation state: (Undefined)
    Field name: 'Organism address' - Field value: '52 Rue de Marclan, 31600 MURET, France' - Validation state: (Valid)

    The complete solution

    Using XtractFlow to achieve custom data extraction
    Copy Code
            static void runExtraction()
            {
                Configuration.RegisterGdPictureKey("GDPICTURE_KEY");
                Configuration.RegisterLLMProvider(new OpenAIProvider(OPENAI_KEY));
                Configuration.ResourcesFolder = "resources";
                // building the component
                ProcessorComponent component = buildComponent();
                // processing the document
                ProcessorResult result = new DocumentProcessor().Process("orpalis_certificate.jpg", component);
                // analyzing results
                if (result.ExtractedFields != null)
                {
                    foreach (var item in result.ExtractedFields)
                    {
                        Console.WriteLine($"Field name: '{item.FieldName}' - Field value: '{item.Value}' - Validation state: ({item.ValidationState})");
                    }
                }
            }
    
            static ProcessorComponent buildComponent()
            {
                return new ProcessorComponent()
                {
                    EnableClassifier = false, //classification is not required as a single class of documents will be processed.
                    EnableFieldsExtraction = true, //enabling extraction of fields specified from the previously defined template.
                    Templates = new DocumentTemplate[] {
                        buildOrpalisCertificateTemplate()
                        }
                };
            }
    
            static DocumentTemplate buildOrpalisCertificateTemplate()
            {
                return new DocumentTemplate()
                {
                    Name = "ORPALIS certificate",
                    Identifier = "8843294B-5840-4693-8D2A-C4CF76DB1060",
                    SemanticDescription = "ORPALIS certificate of completion",
                    Fields = new List<TemplateField>
                {
                new()
                {
                    Name = "Year",
                    Format = FieldDataFormat.Number,
                    SemanticDescription = "The year of certificate delivery"
                },
                new()
                {
                    Name = "Student",
                    Format = FieldDataFormat.Text,
                    SemanticDescription = "The person who received the certificate"
                },
                new()
                {
                    Name = "Mentor",
                    Format = FieldDataFormat.Text,
                    SemanticDescription = "The mentor of the student"
                },
                new()
                {
                    Name = "Jury member",
                    Format = FieldDataFormat.Text,
                    SemanticDescription = "The member of the jury"
                },
                new()
                {
                    Name = "Achievement",
                    Format = FieldDataFormat.Text,
                    SemanticDescription = "The achievement of the student"
                },
                new()
                {
                    Name = "Organism address",
                    Format = FieldDataFormat.Text,
                    SemanticDescription = "The postal address of the organism",
                    StandardValidationMethods = new[]{ new StandardFieldValidationMethod( StandardFieldValidation.PostalAddressIntegrity) }
                }
            }
                };
            }

    The end!