Parse PDF content on Android
Parsing text and other content from a PDF can be a complex task, so we offer several abstractions to make this simpler. In a PDF, the text usually consists of glyphs that are positioned at absolute coordinates without any relative association with neighboring glyphs. Nutrient heuristically splits these glyphs up into words and blocks of the text. Our user interface leverages this information to allow users to select and annotate text. You can read more about this in our text selection guide.
Text parsing
PdfDocument
offers methods that allow you to access text from a given PDF page. The text parsing API ensures that the text extracted from a PDF follows any logical structure defined in the PDF (see section 14.7 of the PDF specification), thereby enabling support for correctly handling different text-writing directions. Using PdfDocument#getPageText()
as shown below allows you to get all the text found on a single page:
val document: PdfDocument = ... val pageText = document.getPageText(0)
PdfDocument document = ...
String pageText = document.getPageText(0);