Document classification and recognition

This guide describes the steps required to create a solution for categorizing and extracting data with XtractFlow’s AI-powered technology.

Example use case

The CRM system of ACME company is currently handling a variety of document types, such as invoices, resumes, purchase orders, and payroll statements. There’s a need for an automated solution that can intelligently categorize these diverse documents and extract pertinent data based on their respective categories.

Setting up document templates

An enumerable of DocumentTemplate objects must be created. This collection will represent document templates, serving as the comprehensive definition for specific types of documents, applicable to both the classification and extraction processes:

static List<DocumentTemplate> setupDocumentTemplates()
{
    List<DocumentTemplate> templates = new List<DocumentTemplate>();
    templates.Add(DocumentTemplates.Invoice); // Adding invoice preset.
    templates.Add(DocumentTemplates.Resume); // Adding resume preset.
    templates.Add(DocumentTemplates.PurchaseOrder); // Adding purchase order preset.
    templates.Add(DocumentTemplates.PayrollStatement); // Adding payroll statement preset.
    return templates;
}

Building the component

Create a ProcessorComponent object, which is a necessary component for the processor. This object will encapsulate the document processing workflow’s logic:

static ProcessorComponent buildComponent()
{
    return new ProcessorComponent()
    {
        EnableClassifier = true, // Enabling classification.
        EnableFieldsExtraction = true, // Enabling extraction.
        Templates = setupDocumentTemplates()
    };
}

Processing the documents

At this point, it’s necessary to instantiate a DocumentProcessor object and invoke the Process method to initiate the inference process. Subsequently, a ProcessorResult object that encompasses the processing outcome will be returned:

// Building the component.
ProcessorComponent component = buildComponent();
// Processing all documents.
foreach (string documentFile in Directory.GetFiles([DIRECTORY_PATH]))
{
    ProcessorResult result = new DocumentProcessor().Process(documentFile, component);
    // Analyzing results.
    if (result.Template != null)
    {
        Console.WriteLine("Document category:" + result.Template.Name);
        if (result.ExtractedFields != null)
        {
            foreach (var item in result.ExtractedFields)
            {
                Console.WriteLine($"Field name: '{item.FieldName}' - Field value: '{item.Value}' - Validation state: ({item.ValidationState})");
            }
        }
    }
}

The complete solution

Here’s the full example showing how to classify documents and extract data:

static void runExtraction()
{
    Configuration.RegisterGdPictureKey("GDPICTURE_KEY");
    Configuration.RegisterLLMProvider(new OpenAIProvider(OPENAI_KEY));
    Configuration.ResourcesFolder = "resources";
    // Building the component.
    ProcessorComponent component = buildComponent();
    // Processing all documents.
    foreach (string documentFile in Directory.GetFiles([DIRECTORY_PATH]))
    {
        ProcessorResult result = new DocumentProcessor().Process(documentFile, component);
        // Analyzing results.
        if (result.Template != null)
        {
            Console.WriteLine("Document category:" + result.Template.Name);
            if (result.ExtractedFields != null)
            {
                foreach (var item in result.ExtractedFields)
                {
                    Console.WriteLine($"Field name: '{item.FieldName}' - Field value: '{item.Value}' - Validation state: ({item.ValidationState})");
                }
            }
        }
    }
}

static ProcessorComponent buildComponent()
{
    return new ProcessorComponent()
    {
        EnableClassifier = true, // Enabling classification.
        EnableFieldsExtraction = true, // Enabling extraction.
        Templates = setupDocumentTemplates()
    };
}

static List<DocumentTemplate> setupDocumentTemplates()
{
    List<DocumentTemplate> templates = new List<DocumentTemplate>();
    templates.Add(DocumentTemplates.Invoice); // Adding invoice preset.
    templates.Add(DocumentTemplates.Resume); // Adding resume preset.
    templates.Add(DocumentTemplates.PurchaseOrder); // Adding purchase order preset.
    templates.Add(DocumentTemplates.PayrollStatement); // Adding payroll statement preset.
    return templates;
}