Extract Text, Tables, and More from PDFs

This guide explains how to extract data from PDFs using Document Engine.

You can extract the following pieces of information from a PDF document:

Sending the Request to Extract Data

To extract data on all pages of a document, post a multipart request to the /api/build endpoint(opens in a new tab). In the instructions, specify the following output parameters:

  • type specifies the output type. Set this to json-content.
  • plainText is a Boolean value that determines whether to extract data as plain text.
  • structuredText is a Boolean value that determines whether to extract data as structured text. Enabling this option gives you information about characters, lines, paragraphs, and words.
  • keyValuePairs is a Boolean value that determines whether to extract key-value pairs.
  • tables is a Boolean value that determines whether to extract table data.
  • language specifies the language used for recognizing text with optical character recognition (OCR). Sometimes, text is stored in a PDF or an image in a way that makes it so you cannot search or copy it. PSPDFKit’s OCR engine allows you to recognize text and save it in a separate file where you can both search and copy and paste the text.
Terminal window
curl -X POST http://localhost:5000/api/build \
-H "Authorization: Token token=<API token>" \
-F document=@/path/to/example-document.pdf \
-F instructions='{
"parts": [
{
"file": "document"
}
],
"output": {
"type": "json-content",
"plainText": true,
"structuredText": true,
"keyValuePairs": true,
"tables": true,
"language": "english"
}
}' \
-o result.json

For more information on the Build instructions, refer to the API Reference(opens in a new tab).

Interpreting the Data Extraction Response

The API response provides information about the data you included in the API request, such as:

  • Plain text
  • Structured text with information about characters, lines, paragraphs, and words
  • Extracted key-value pairs
  • Tables

Example Data Extraction Response

{
"pages": [
{
"pageIndex": 0,
"plainText": "Lorem ipsum dolor sit amet, consectetuer adipiscing elit. Aenean commodo ligula eget dolor. Aenean massa.\n",
"structuredText": {
"characters": [
{
"bbox": {
"left": 0,
"top": 0,
"width": 100,
"height": 100
},
"value": "T"
}
],
"lines": [
{
"bbox": {
"left": 0,
"top": 0,
"width": 100,
"height": 100
},
"firstWordIndex": 0,
"isRTL": false,
"isVertical": false,
"wordCount": 5
}
],
"paragraphs": [
{
"bbox": {
"left": 0,
"top": 0,
"width": 100,
"height": 100
},
"firstLineIndex": 0,
"lineCount": 3
}
],
"words": [
{
"bbox": {
"left": 0,
"top": 0,
"width": 100,
"height": 100
},
"characterCount": 4,
"firstCharacterIndex": 0,
"isFromDictionary": true,
"value": "word"
}
]
},
"keyValuePairs": [
{
"confidence": 95.4,
"key": {
"bbox": {
"left": 0,
"top": 0,
"width": 100,
"height": 100
},
"content": "#"
},
"value": {
"bbox": {
"left": 0,
"top": 0,
"width": 100,
"height": 100
},
"content": "€",
"dataType": "Currency"
}
}
],
"tables": [
{
"confidence": 95.4,
"bbox": {
"left": 0,
"top": 0,
"width": 100,
"height": 100
},
"cells": [
{
"bbox": {
"left": 0,
"top": 0,
"width": 100,
"height": 100
},
"rowIndex": 0,
"columnIndex": 0,
"isHeader": true,
"text": "Invoice number"
}
],
"columns": [
{
"bbox": {
"left": 0,
"top": 0,
"width": 100,
"height": 100
}
}
],
"lines": [
{
"bbox": {
"left": 0,
"top": 0,
"width": 100,
"height": 100
},
"isVertical": false,
"thickness": 0
}
],
"rows": [
{
"bbox": {
"left": 0,
"top": 0,
"width": 100,
"height": 100
}
}
]
}
]
}
]
}