This tutorial describes the steps required to create a custom document data extraction template.
A renowned company hosted a talent contest and awarded certificates of completion to all successful participants.
According to company policies, it is essential to maintain records of all rewards in the rewards archive system.
Due to the significant number of talented participants, this has led to the generation of a substantial volume of documents.
Consequently, the HR department has expressed concerns regarding the labor-intensive nature of manually processing this high volume of data.
In response to this challenge, the engineering department has been tasked with swiftly developing an intelligent data processing system designed to efficiently capture and manage this data.
The most evident solution was to employ XtractFlow to create a tailored data extraction template, which would serve to capture all the necessary information.
-> Check the Prerequisites page.
A DocumentTemplate object must be created.
This object will represent a document template, which serves as the comprehensive definition for a specific type of document.
It should clearly define a unique identifier, a public name, provide a semantic description, and outline a set of fields to be extracted. In this use case the following information require to be extracted:
XtractFlow Document template generation in csharp |
Copy Code |
---|---|
static DocumentTemplate buildOrpalisCertificateTemplate() { return new DocumentTemplate() { Name = "ORPALIS certificate", Identifier = "8843294B-5840-4693-8D2A-C4CF76DB1060", SemanticDescription = "ORPALIS certificate of completion", Fields = new List<TemplateField> { new() { Name = "Year", Format = FieldDataFormat.Number, SemanticDescription = "The year of certificate delivery" }, new() { Name = "Student", Format = FieldDataFormat.Text, SemanticDescription = "The person who received the certificate" }, new() { Name = "Mentor", Format = FieldDataFormat.Text, SemanticDescription = "The mentor of the student" }, new() { Name = "Jury member", Format = FieldDataFormat.Text, SemanticDescription = "The member of the jury" }, new() { Name = "Achievement", Format = FieldDataFormat.Text, SemanticDescription = "The achievement of the student" }, new() { Name = "Organism address", Format = FieldDataFormat.Text, SemanticDescription = "The postal address of the organism", StandardValidationMethods = new[]{ new StandardFieldValidationMethod( StandardFieldValidation.PostalAddressIntegrity) } } } }; } |
Create a ProcessorComponent object, which is a necessary component for the processor.
This object will encapsulate the document processing workflow's logic.
XtractFlow ProcessorComponent generation in csharp |
Copy Code |
---|---|
static ProcessorComponent buildComponent() { return new ProcessorComponent() { EnableClassifier = false, // classification is not required as a single class of documents will be processed. EnableFieldsExtraction = true, // enabling extraction of fields specified from the previously defined template. Templates = new DocumentTemplate[] { buildOrpalisCertificateTemplate() } }; } |
At this point, it is necessary to instantiate a DocumentProcessor object and invoke the Process method to initiate the inference process.
Subsequently, a ProcessorResult object will be returned, encompassing the processing outcome.
Using XtractFlow DocumentProcessor in csharp |
Copy Code |
---|---|
// building the component ProcessorComponent component = buildComponent(); // processing the document ProcessorResult result = new DocumentProcessor().Process("orpalis_certificate.jpg", component); // analyzing results if (result.ExtractedFields != null) { foreach (var item in result.ExtractedFields) { Console.WriteLine($"Field name: '{item.FieldName}' - Field value: '{item.Value}' - Validation state: ({item.ValidationState})"); } } |
Obtained results:
Field name: 'Year' - Field value: '2023' - Validation state: (Undefined)
Field name: 'Student' - Field value: 'Fabio Escobar' - Validation state: (Undefined)
Field name: 'Mentor' - Field value: 'Loïc Carrère' - Validation state: (Undefined)
Field name: 'Jury member' - Field value: 'Olivier Houssin' - Validation state: (Undefined)
Field name: 'Achievement' - Field value: 'Successfully juggled with 3 bananas' - Validation state: (Undefined)
Field name: 'Organism address' - Field value: '52 Rue de Marclan, 31600 MURET, France' - Validation state: (Valid)
Using XtractFlow to achieve custom data extraction |
Copy Code |
---|---|
static void runExtraction() { Configuration.RegisterGdPictureKey("GDPICTURE_KEY"); Configuration.RegisterLLMProvider(new OpenAIProvider(OPENAI_KEY)); Configuration.ResourcesFolder = "resources"; // building the component ProcessorComponent component = buildComponent(); // processing the document ProcessorResult result = new DocumentProcessor().Process("orpalis_certificate.jpg", component); // analyzing results if (result.ExtractedFields != null) { foreach (var item in result.ExtractedFields) { Console.WriteLine($"Field name: '{item.FieldName}' - Field value: '{item.Value}' - Validation state: ({item.ValidationState})"); } } } static ProcessorComponent buildComponent() { return new ProcessorComponent() { EnableClassifier = false, //classification is not required as a single class of documents will be processed. EnableFieldsExtraction = true, //enabling extraction of fields specified from the previously defined template. Templates = new DocumentTemplate[] { buildOrpalisCertificateTemplate() } }; } static DocumentTemplate buildOrpalisCertificateTemplate() { return new DocumentTemplate() { Name = "ORPALIS certificate", Identifier = "8843294B-5840-4693-8D2A-C4CF76DB1060", SemanticDescription = "ORPALIS certificate of completion", Fields = new List<TemplateField> { new() { Name = "Year", Format = FieldDataFormat.Number, SemanticDescription = "The year of certificate delivery" }, new() { Name = "Student", Format = FieldDataFormat.Text, SemanticDescription = "The person who received the certificate" }, new() { Name = "Mentor", Format = FieldDataFormat.Text, SemanticDescription = "The mentor of the student" }, new() { Name = "Jury member", Format = FieldDataFormat.Text, SemanticDescription = "The member of the jury" }, new() { Name = "Achievement", Format = FieldDataFormat.Text, SemanticDescription = "The achievement of the student" }, new() { Name = "Organism address", Format = FieldDataFormat.Text, SemanticDescription = "The postal address of the organism", StandardValidationMethods = new[]{ new StandardFieldValidationMethod( StandardFieldValidation.PostalAddressIntegrity) } } } }; } |