JavaScript PDF Parser Library
Documents can contain a variety of different data in many formats: text, annotations, digital signatures, etc. With PSPDFKit for Web, you can parse that data separately and process it according to your needs.
PSPDFKit for Web’s API includes a variety of methods to enable access to different types of content from a document.
Page Information
It’s possible to retrieve basic information from a specific page — like page dimensions, rotation, and labels. A call to Instance#pageInfoForIndex
can return that information for you in a PSPDFKit.PageInfo
object:
const { width, height, index, label, rotation } = instance.pageInfoForIndex(0);
Page Text
Retrieving the text of a page can be done using Instance#textLinesForPageIndex
, which returns a Promise
resolving to a PSPDFKit.Immutable.List
of PSPDFKit.TextLine
. In turn, this can be traversed to parse the content of each line:
// Retrieve and log text lines for page 0. const textLines = await instance.textLinesForPageIndex(0); textLines.forEach((textLine, textLineIndex) => { console.log(`Content for text line ${textLineIndex}`); console.log(`Text: ${textLine.contents}`); console.log(`Id: ${textLine.id}`); console.log(`Page index: ${textLine.pageIndex}`); console.log(`Bounding box: ${JSON.stringify(textLine.boundingBox.toJS())}`); });
Form Fields
It’s possible to retrieve detailed information about each form field in a document with Instance#getFormFields
:
const formFields = await instance.getFormFields();
You can check each form field type’s properties in the corresponding API reference section.
Form Field Values
Similarly to form fields, form field values can be retrieved with Instance#getFormFieldValues
:
const values = instance.getFormFieldValues();
The returned object includes each form field value indexed by the form field name.
Annotation Text
Some annotation types can include text as one of their properties:
// Retrieve annotations from page 0. const annotations = await instance.getAnnotations(0); // Retrieves the first text annotation available. const textAnnotation = annotations.find(annotation => annotation instanceof PSPDFKit.Annotations.TextAnnotation); // Logs the text of the text annotation. console.log(textAnnotation.text);
Note annotations can also include text as one of their properties.
Text under an Annotation
Markup annotations can be used to highlight or draw attention to some text in the document. That text isn’t part of the annotation’s properties, but it can be obtained by mapping the annotation’s bounding box to the bounding boxes of the text lines of the page.
PSPDFKit for Web makes that operation easy by providing Instance#getMarkupAnnotationText
and Instance#getTextFromRects
:
// Retrieve annotations from page 0. const annotations = await instance.getAnnotations(0); // Retrieves the first highlight annotation available. const highlightAnnotation = annotations.find(annotation => annotation instanceof PSPDFKit.Annotations.HighlightAnnotation); // Logs the text behind the highlight annotation. console.log(await instance.getMarkupAnnotationText(highlightAnnotation));
Bookmarks
Extracting bookmark information can be done with PSPDFKit for Web’s Instancel#getBookmarks
method:
const bookmarks = await instance.getBookmarks(); bookmarks.forEach(bookmark => { console.log(bookmark.toJS()); });
Digital Signatures
When your license includes the Digital Signatures component, you can extract digital signature information from any digitally signed document. This can also be done through Instance#getSignaturesInfo
, which resolves to a PSPDFKit.SignaturesInfo
record. This object includes:
-
A
signatures
array with individualPSPDFKit.SignatureInfo
data for each signature -
A
status
field with document validation information
const signaturesInfo = await instance.getSignaturesInfo();