Extract Text, Tables, and More from PDFs
This guide explains how to extract data from PDFs using Document Engine.
You can extract the following pieces of information from a PDF document:
- Text
- Tables
- Key-value pairs. For more information, refer to the guide on how key-value pair extraction works.
Sending the Request to Extract Data
To extract data on all pages of a document, post a multipart request to the /api/build
endpoint(opens in a new tab). In the instructions, specify the following output parameters:
type
specifies the output type. Set this tojson-content
.plainText
is a Boolean value that determines whether to extract data as plain text.structuredText
is a Boolean value that determines whether to extract data as structured text. Enabling this option gives you information about characters, lines, paragraphs, and words.keyValuePairs
is a Boolean value that determines whether to extract key-value pairs.tables
is a Boolean value that determines whether to extract table data.language
specifies the language used for recognizing text with optical character recognition (OCR). Sometimes, text is stored in a PDF or an image in a way that makes it so you cannot search or copy it. PSPDFKit’s OCR engine allows you to recognize text and save it in a separate file where you can both search and copy and paste the text.
curl -X POST http://localhost:5000/api/build \ -H "Authorization: Token token=<API token>" \ -F document=@/path/to/example-document.pdf \ -F instructions='{ "parts": [ { "file": "document" } ], "output": { "type": "json-content", "plainText": true, "structuredText": true, "keyValuePairs": true, "tables": true, "language": "english" }}' \ -o result.json
POST /api/build HTTP/1.1Content-Type: multipart/form-data; boundary=customboundaryAuthorization: Token token=<API token>
--customboundaryContent-Disposition: form-data; name="document"; filename="example-document.pdf"Content-Type: application/pdf
<PDF data>--customboundaryContent-Disposition: form-data; name="instructions"Content-Type: application/json
{ "parts": [ { "file": "document" } ], "output": { "type": "json-content", "plainText": true, "structuredText": true, "keyValuePairs": true, "tables": true, "language": "english" }}--customboundary--
For more information on the Build instructions, refer to the API Reference(opens in a new tab).
Interpreting the Data Extraction Response
The API response provides information about the data you included in the API request, such as:
- Plain text
- Structured text with information about characters, lines, paragraphs, and words
- Extracted key-value pairs
- Tables
Example Data Extraction Response
{ "pages": [ { "pageIndex": 0, "plainText": "Lorem ipsum dolor sit amet, consectetuer adipiscing elit. Aenean commodo ligula eget dolor. Aenean massa.\n", "structuredText": { "characters": [ { "bbox": { "left": 0, "top": 0, "width": 100, "height": 100 }, "value": "T" } ], "lines": [ { "bbox": { "left": 0, "top": 0, "width": 100, "height": 100 }, "firstWordIndex": 0, "isRTL": false, "isVertical": false, "wordCount": 5 } ], "paragraphs": [ { "bbox": { "left": 0, "top": 0, "width": 100, "height": 100 }, "firstLineIndex": 0, "lineCount": 3 } ], "words": [ { "bbox": { "left": 0, "top": 0, "width": 100, "height": 100 }, "characterCount": 4, "firstCharacterIndex": 0, "isFromDictionary": true, "value": "word" } ] }, "keyValuePairs": [ { "confidence": 95.4, "key": { "bbox": { "left": 0, "top": 0, "width": 100, "height": 100 }, "content": "#" }, "value": { "bbox": { "left": 0, "top": 0, "width": 100, "height": 100 }, "content": "€", "dataType": "Currency" } } ], "tables": [ { "confidence": 95.4, "bbox": { "left": 0, "top": 0, "width": 100, "height": 100 }, "cells": [ { "bbox": { "left": 0, "top": 0, "width": 100, "height": 100 }, "rowIndex": 0, "columnIndex": 0, "isHeader": true, "text": "Invoice number" } ], "columns": [ { "bbox": { "left": 0, "top": 0, "width": 100, "height": 100 } } ], "lines": [ { "bbox": { "left": 0, "top": 0, "width": 100, "height": 100 }, "isVertical": false, "thickness": 0 } ], "rows": [ { "bbox": { "left": 0, "top": 0, "width": 100, "height": 100 } } ] } ] } ]}