Extract key-value pairs from PDFs to JSON
The PDF recognition to JSON step automatically extracts important data from searchable PDF files in the form of key-value pairs. The output is given as a JSON file that contains each expected key, along with its value pair.
A UI program exists to test PDF files and show what data pairs will be extracted from the file. This program can be found at:
“<Autobahn DX Installation directory>\distribution\recognition\AquaforestDataExtractorUI.exe”
.
You must use an Expected Key file to tell Nutrient Document Automation Server (DAS) — previously known as Autobahn DX — which keys to extract from the input files. You can also specify synonyms for your keys, so that values paired with any synonym will also be extracted with the key. This is useful when processing files with varying formats and different ways of framing the same data. The example Expected Key file below highlights how this file can be used to cover multiple ways of naming for the same key.
Read about Job Designer for more information about each step property.
Example of an Expected Key file:
{ "expectedKeys": \[ { "expectedKey": "Invoice No", "synonyms": \[ "Invoice Number", "Invoice No.", "Invoice Num" \] }, { "expectedKey": "Inv Date", "synonyms": \[ "Invoice Date", "Inv. Date", "Inv date" \] }, { "expectedKey": "Reference", "synonyms": \[ \] }, { "expectedKey": "City/State/Zip", "synonyms": \[ "Postcode" \] } \] }