Extract Text from PDFs and Images

This guide explains how to extract text from a PDF documents using Document Engine.

Sending the Request to Extract Data

To extract text from a document, post a multipart request to the /api/build endpoint(opens in a new tab). In the instructions, specify the following output parameters:

type specifies the output type. Set this to json-content.
plainText is a Boolean value that determines whether to extract data as plain text.
structuredText is a Boolean value that determines whether to extract data as structured text. Enabling this option gives you information about characters, lines, paragraphs, and words.
language specifies the language used for recognizing text with optical character recognition (OCR). Sometimes, text is stored in a PDF or an image in a way that makes it so you cannot search or copy it. PSPDFKit’s OCR engine allows you to recognize text and save it in a separate file where you can both search and copy and paste the text.

SHELL
HTTP

curl -X POST http://localhost:5000/api/build \
  -H "Authorization: Token token=<API token>" \
  -F document=@/path/to/example-document.pdf \
  -F instructions='{
  "parts": [
    {
      "file": "document"
    }
  ],
  "output": {
    "type": "json-content",
    "plainText": true,
    "structuredText": true,
    "language": "english"
  }
}' \
  -o result.pdf

POST /api/build HTTP/1.1
Content-Type: multipart/form-data; boundary=customboundary
Authorization: Token token=<API token>

--customboundary
Content-Disposition: form-data; name="document"; filename="example-document.pdf"
Content-Type: application/pdf

<PDF data>
--customboundary
Content-Disposition: form-data; name="instructions"
Content-Type: application/json

{
  "parts": [
    {
      "file": "document"
    }
  ],
  "output": {
    "type": "json-content",
    "plainText": true,
    "structuredText": true,
    "language": "english"
  }
}
--customboundary--

For more information on the Build instructions, refer to the API Reference(opens in a new tab).

Example Data Extraction Response

{
  "pages": [
    {
      "pageIndex": 0,
      "plainText": "Lorem ipsum dolor sit amet, consectetuer adipiscing elit. Aenean commodo ligula eget dolor. Aenean massa.\n",
      "structuredText": {
        "characters": [
          {
            "bbox": {
              "left": 0,
              "top": 0,
              "width": 100,
              "height": 100
            },
            "value": "T"
          }
        ],
        "lines": [
          {
            "bbox": {
              "left": 0,
              "top": 0,
              "width": 100,
              "height": 100
            },
            "firstWordIndex": 0,
            "isRTL": false,
            "isVertical": false,
            "wordCount": 5
          }
        ],
        "paragraphs": [
          {
            "bbox": {
              "left": 0,
              "top": 0,
              "width": 100,
              "height": 100
            },
            "firstLineIndex": 0,
            "lineCount": 3
          }
        ],
        "words": [
          {
            "bbox": {
              "left": 0,
              "top": 0,
              "width": 100,
              "height": 100
            },
            "characterCount": 4,
            "firstCharacterIndex": 0,
            "isFromDictionary": true,
            "value": "word"
          }
        ]
      }
    }
  ]
}

Extract Text from PDFs and Images

Sending the Request to Extract Data

Example Data Extraction Response

Was this helpful?

Help us improve

Thank you for your feedback!

Something went wrong. Please try again or let us know.