Streamline key-value pair extraction from documents
PSPDFKit Processor has been deprecated and replaced by Document Engine. To start using Document Engine, refer to the migration guide. With Document Engine, you’ll have access to robust new capabilities (read the blog for more information).
PSPDFKit’s key-value pair (KVP) extraction engine enables you to recognize related data items in a document and export them to an external destination like a spreadsheet.
Key-Value Pair Extraction
Key-value pairs are two related data items: a key and a value. Depending on the type of document, the key-value pairs are different. For example, key-value pairs on invoices can be the following:
Key | Value |
---|---|
Invoice Number | No 00162 |
Billing Date | 20/09/2022 |
Total | 1,165.10€ |
Key-value pair fields on government forms can be the following:
Key | Value |
---|---|
Company Name | Nutrient GmbH |
Registration Number | FN 548939p |
Date of Incorporation | 04/10/2013 |
It’s easy to get key-value pairs from structured documents like Excel files because the values are all named. However, 90 percent of all documents have unstructured data. For these documents, you need a KVP extraction tool to retrieve the information. Intelligent document processing (IDP) extracts data from unstructured and semi-structured documents using OCR and artificial intelligence (AI) technologies.
By automatically recognizing the document type, PSPDFKit adapts to the context and determines the extraction approach that makes the best use of available resources. PSPDFKit recognizes the document type based on adaptive layout understanding and natural language processing (NLP) technologies. This hybrid approach includes heuristics, mathematics, and machine learning (ML) capabilities that address the usual weaknesses of traditional OCR and pure ML engines.
How to Use Key-Value Pair Extraction
This example retrieves key-value pairs, such as the invoice number and billing date, from the scanned invoice below.
Format the data extraction result to obtain the following table:
Key | Value | Document Type | Confidence Level |
---|---|---|---|
Billing date | 20/09/2022 | DateTime | 100% |
Order date | 20/09/2022 | DateTime | 100% |
Republic of PDF | +100 847 738 227 | PhoneNumber | 77.2% |
IBAN | AT13 2060 4236 6111 5994 | IBAN | 100% |
Customer | Vandelay Industries Around the Corner 13 NBC City | String | 69.8% |
Delivery address | Vandelay Industries Around the Corner 13 NBC City | String | 69.9% |
Invoice number | No 00162 | String | 70.9% |
Ref. number | 34751 | Number | 92.9% |
No | 00162 | Number | 100% |
Reference | P00201 | UID | 100% |
Quantity Total (excl. VAT) | 320.00€ | Currency | 59% |
Subtotal | 1,220.00€ | Currency | 100% |
Discount (10%) | -122.00€ | Currency | 90.6% |
VAT (5.5%) | +6710€ | Currency | 66.9% |
Shipping cost | 0.00€ | Currency | 75% |
TOTAL | 1,165.10€ | Currency | 100% |
Description | Lake Mirror | String | 99.6% |
VAT | 5.5% | Percentage | 66.6% |
Price per unit (excl. VAT) | 320.00€ | Currency | 80% |
Tax No. | AT98765321 | UID | 73.8% |
# | [email protected] | EmailAddress | 65.6% |
# | www.bruuuk.com | URL | 65.6% |
This table also contains information about the data type and the confidence level for each key-value pair:
-
The data type describes the nature of the content. In this example, the engine recognizes the value
[email protected]
as an email and the value+100 847 738 227
as a phone number. -
The confidence level describes how confident the KVP engine is in the accuracy of the data extraction.
In this example, the KVP engine automatically detected all key-value pairs in the document with minimal code and without any preconfiguration. The engine supports many formats and languages, and it has no dependencies to external models, resources, and databases.