Transforming document extraction with machine learning
PSPDFKit Processor has been deprecated and replaced by Document Engine. To start using Document Engine, refer to the migration guide. With Document Engine, you’ll have access to robust new capabilities (read the blog for more information).
The extraction of key-value pairs involves two tasks:
-
Use OCR technology to recognize unstructured information and text in a document.
-
Use machine learning, specifically deep learning, to make sense of the unstructured information by composing links between different parts of the extracted text.
A combination of both approaches is necessary to achieve the best results in data extraction. For this reason, PSPDFKit recognizes text and key-value pairs based on a hybrid approach of the following methods:
-
Heuristics
-
Mathematics
-
Machine learning (ML)
This approach produces superior results compared to traditional optical character recognition (OCR) and pure ML approaches.
Traditional Approaches
This section explains the traditional approaches to key-value pair extraction.
Traditional OCR
Extracting data with the traditional OCR approach is based on heuristics. The biggest limitation of the traditional OCR approach is that it needs to use a different template for each document type. This works well for simple documents with structured data. However, extracting data with the traditional OCR approach doesn’t perform well with unstructured or semi-structured documents.
Extracting data with this approach suffers from the same limitations as traditional OCR engines that have difficulties recognizing text in the following contexts:
-
Colored backgrounds
-
Glaring
-
Skew
-
Text in tables and graphics
-
Handwritten text
Lastly, data extraction solutions relying only on traditional OCR are difficult to scale.
Machine Learning and Deep Learning
Data extraction solutions that leverage machine learning and deep learning use artificial intelligence (AI) technologies to mitigate traditional OCR limitations. These deep learning approaches are usually a combination of different techniques, such as convolutional neural networks, long short-term memory layers, transformers, and graph neural networks.
Data extraction relying only on machine learning and deep learning often fails for documents with a lot of noise and dotted lines.
PSPDFKit Data Model
By automatically recognizing the document type, PSPDFKit adapts to the context and determines the extraction approach that makes the best use of available resources. PSPDFKit recognizes the document type based on adaptive layout understanding and natural language processing (NLP) technologies. This hybrid approach includes heuristics, mathematics, and machine learning (ML), and it address the usual weaknesses of the traditional OCR and pure ML engines.
The PSPDFKit data model enables you to extract data from documents with excellent results. PSPDFKit’s hybrid approach performs better than traditional OCR and pure ML engines, especially for documents with the following features:
-
Noise
-
Dotted lines
-
Broken characters
-
Text on colored backgrounds
-
Underlined text
-
Skewed text
-
Text in graphics and tables