Extract text from PDFs in Windows

This guide shows how to extract the full text content from a single page or a whole document.

For more granular control over text extraction, refer to our parsing guide, which outlines the available text APIs in greater detail.

Page text

The TextParser API offers a simple way to get the text from a given PDF page:

var textParser = await PDFView.Document.GetTextParserAsync(0);
var textBlocks = await textParser.GetTextAsync();

Note that the GetTextAsync method returns a list of TextBlocks. Each of these blocks contain the text found in a specific line (all of the continuous group of glyphs in that line).

Using the returned list of text lines from GetTextAsync, the page text can be unified to a single string:

var unifiedText = new StringBuilder();

for (var i = 0; i < pageCount; i++)
{
    var textParser = await PDFView.Document.GetTextParserAsync(i);
    var textBlocks = await textParser.GetTextAsync());

    foreach (var textBlock in documentTextBlocks)
    {
        unifiedText.Append(textBlock.Contents);
        unifiedText.Append(" ");
    }
}

This will change, depending on your specific use case and document formatting, but it gives an idea of how to structure your TextParser usage. For a more in-depth look at the parser and how it interacts with glyphs and words and text blocks, see the parsing guide.

Document text

As each page has its own TextParser, the idea is similar to the above. Keep in mind that parsing can be performance intensive, especially for larger documents:

var pageCount = await PDFView.Document.GetTotalPageCountAsync();
var documentTextBlocks = new List<TextBlock>();
var unifiedText = "";

for (var i = 0; i < pageCount; i++)
{
    var textParser = await PDFView.Document.GetTextParserAsync(i);
    documentTextBlocks.AddRange(await textParser.GetTextAsync());
}

foreach (var textBlock in documentTextBlocks)
{
    unifiedText += textBlock.Contents + " ";
}