Simplify PDF text extraction on Windows

Extracting text from a PDF can be a complex task, so we offer several abstractions to make this simpler. In a PDF, text usually consists of glyphs that are absolutely positioned. Nutrient heuristically splits these glyphs up into words and blocks of text. Our PdfView leverages this information to allow users to select and annotate text.

Text parser

TextParser offers a simple API to get the text, Glyphs, Words, and TextBlocks from a given PDF page. Every page of a PDF has a text parser that returns information about the text on a page:

var textParser = await doc.GetTextParserAsync(0);
var text = await textParser.GetTextForRectsAsync(rects);
var glyphs = await textParser.GetGlyphsAsync();
var words = TextParser.WordsFromGlyphs(glyphs);
var textsBlocks = await textParser.GetTextAsync();

TextParser also ensures that the text it extracts from a PDF follows any logical structure defined in the PDF (see section 14.7 of the PDF specification(opens in a new tab)), thereby enabling support for correctly handling different text-writing directions.

Glyphs

Glyph is the building block for all text extraction in Nutrient. It represents a single glyph on a PDF page. Its Rect property specifies, in PDF coordinates, where it’s located on the page, and its Contents property returns the text it contains. The Index property specifies the index of the glyph on the page in reading order. Consider a page with the following text:

The quick brown fox jumps over the lazy dog.
--------------------------^

The Glyph that represents the o in over will have an Index of 26. This index is unique to this glyph, and it can be used to directly access it from the Glyphs returned from a TextParser:

var textParser = await doc.GetTextParserAsync(0);
var glyphs = await textParser.GetGlyphsAsync();

// Guaranteed to be `true`.
var indexEqualTo26 = glyphs[26].Index == 26

This makes getting a particular glyph (and glyphs near it) much faster, as you already know the index. Ordering glyphs correctly is important if, for example, you wish to combine multiple glyphs and display something to the user.

Text blocks

A TextBlock returned from the TextParser represents a contiguous group of glyphs, usually in a single line. For PDFs with multiple columns of text, a text block is a single line in a given column.

Words

A Word, as the name suggests, represents a single word in a PDF. TextParser can calculate these words from glyphs using the WordsFromGlyphs method:

var textParser = await doc.GetTextParserAsync(0);
var glyphs = await textParser.GetGlyphsAsync();
var words = TextParser.WordsFromGlyphs(glyphs);

A Word has a Frame that describes the area the word covers on the page. It also has a Range that describes the range within the provided glyphs that make up the word, and a Contents, which is the string of text that Word represents.

Simplify PDF text extraction on Windows

Text parser

Glyphs

Text blocks

Words

Was this helpful?

Help us improve

Thank you for your feedback!

Something went wrong. Please try again or let us know.