Extract tables from PDF documents easily

PSPDFKit Processor has been deprecated and replaced by Document Engine. To migrate to Document Engine and unlock advanced document processing capabilities, refer to our migration guide. Learn more about these enhancements on our blog.

This guide explains how to extract table information from PDF documents using Processor.

Before you get started, make sure Processor is up and running.

You can download and use either of the following sample documents for the examples in this guide:

You’ll be sending multipart POST requests(opens in a new tab) with instructions to Processor’s /build endpoint. To learn more about multipart requests, refer to our blog post on the topic, A Brief Tour of Multipart Requests.

Check out the API Reference to learn more about the /build endpoint and all the actions you can perform on PDFs with PSPDFKit Processor.

Sending the Request to Extract Data

To extract table data from a document, post a multipart request to the /build API endpoint. In the instructions, specify the following output parameters:

type specifies the output type. Set this to json-content.
tables is a Boolean value that determines whether to extract table data.
language specifies the language used for recognizing text with optical character recognition (OCR). Sometimes, text is stored in a PDF or an image in a way that makes it so you cannot search or copy it. PSPDFKit’s OCR engine allows you to recognize text and save it in a separate file where you can both search and copy and paste the text.

SHELL
HTTP

curl -X POST http://localhost:5000/api/build \
  -F document=@/path/to/example-document.pdf \
  -F instructions='{
  "parts": [
    {
      "file": "document"
    }
  ],
  "output": {
    "type": "json-content",
    "tables": true,
    "language": "english"
  }
}' \
  -o result.pdf

POST /api/build HTTP/1.1
Content-Type: multipart/form-data; boundary=customboundary

--customboundary
Content-Disposition: form-data; name="document"; filename="example-document.pdf"
Content-Type: application/pdf

<PDF data>
--customboundary
Content-Disposition: form-data; name="instructions"
Content-Type: application/json

{
  "parts": [
    {
      "file": "document"
    }
  ],
  "output": {
    "type": "json-content",
    "tables": true,
    "language": "english"
  }
}
--customboundary--

For more information on the /build instructions, refer to the API Reference.

Example Data Extraction Response

{
  "pages": [
    {
      "pageIndex": 0,
      "tables": [
        {
          "confidence": 95.4,
          "bbox": {
            "left": 0,
            "top": 0,
            "width": 100,
            "height": 100
          },
          "cells": [
            {
              "bbox": {
                "left": 0,
                "top": 0,
                "width": 100,
                "height": 100
              },
              "rowIndex": 0,
              "columnIndex": 0,
              "isHeader": true,
              "text": "Invoice number"
            }
          ],
          "columns": [
            {
              "bbox": {
                "left": 0,
                "top": 0,
                "width": 100,
                "height": 100
              }
            }
          ],
          "lines": [
            {
              "bbox": {
                "left": 0,
                "top": 0,
                "width": 100,
                "height": 100
              },
              "isVertical": false,
              "thickness": 0
            }
          ],
          "rows": [
            {
              "bbox": {
                "left": 0,
                "top": 0,
                "width": 100,
                "height": 100
              }
            }
          ]
        }
      ]
    }
  ]
}

Extract tables from PDF documents easily

Sending the Request to Extract Data

Example Data Extraction Response

Was this helpful?

Help us improve

Thank you for your feedback!

Something went wrong. Please try again or let us know.