Extract the text position from PDFs on Android

Nutrient Android SDK’s PdfDocument class can be used to get the position of text on a page.

getPageTextRects() returns a list of RectF coordinates split into consecutive blocks of characters based on the passed range of character indexes for the page.

Here’s an example showing how to get the text positions for each text block on a page:

val pageIndex = 0
val pageText = document.getPageText(pageIndex)

for (textRect in document.getPageTextRects(pageIndex, 0, pageText.length)) {
    // Get the text for the block.
    val blockText = document.getPageText(pageIndex, textRect)
    // And the position/area of the block is `textRect`.
    ...
}
final int pageIndex = 0;
final String pageText = document.getPageText(pageIndex);

for (final RectF textRect : document.getPageTextRects(pageIndex, 0, pageText.length())) {
    // Get the text for the block.
    final String blockText = document.getPageText(pageIndex, textRect);
    // And the position/area of the block is `textRect`.
    ...
}

For more granular character positions or word positions, the arguments of getPageTextRects() can be tweaked to specify the character offset and length desired. For example, use the following to get the position of the first instance of the word PSPDFKit on the first page of a document:

val pageIndex = 0
val textPosition = document.getPageText(pageIndex).indexOf("PSPDFKit")
val textRects = document.getPageTextRects(pageIndex, textPosition, "PSPDFKit".length, true)
final int pageIndex = 0;
final int textPosition = document.getPageText(pageIndex).indexOf("PSPDFKit");
final List<RectF> textRects = document.getPageTextRects(pageIndex, textPosition, "PSPDFKit".length(), true);