PDF Text Extraction in Swift

Nishant Desai

May 20, 2019

PDF is arguably the most important file format(opens in a new tab), and it is extensively used everywhere. For example, your credit card bills and phone bills are usually sent to you in PDF format, and restaurants will post their menus online this way. Tickets for movies or concerts come as PDF attachments, while students may read books or receive class material in the form of PDFs. Businesses use PDFs to share important information and contracts, and PDF has even become the de facto standard for resumes.

All these documents have one thing in common, and that is that they contain a lot of information in the form of text. Naturally, having the ability to extract text comes in handy, especially in cases where you’d want to copy and paste information or excerpts into another app, share it via messages, etc.

Extraction

PDFKit(opens in a new tab) offers a very convenient way to extract the entire text from a page. You simply need a PDFPage(opens in a new tab) object for a particular page you want to extract text from. It has two properties you can use to do this: The string(opens in a new tab) property just gives back plain text, while the attributedString(opens in a new tab) property gives back a string with all its attributes (NSAttributedString(opens in a new tab)).

PDFPage allows you to retrieve information about individual characters as well. If you want to find a character at a given point on the page, you would do something like this:

// `page` is a `PDFPage` object, and it is returning all the text it contains in the form of a `String` object.
let pageText = page.string

// `point` is the point (`CGPoint`) on the page you wish to find a character at.
let charIndex = page.characterIndex(at: point)

let characterBounds = page.characterBounds(at: charIndex)

The above API can be used when you know about the character indexes, but it’s not very useful when you want to deal with a range of characters. But when the API used above is combined with another API of PDFKit called PDFSelection(opens in a new tab), it becomes easier to manage a selection of a range of characters.

There are a few different ways to work with PDFSelection. It encapsulates the result of the selection based on the coordinates provided. It can encapsulate the selection for a word, an entire line of text, or just the text at specified coordinates. You can also use NSRange to extract text using PDFPage and PDFSelection. This can come in handy when you’d prefer not to deal with coordinates:

let selectionPoint = CGPoint(x: 100, y: 100)

// Returns the selection for only the word the character is a part of.
let wordSelection = page.selectionForWord(at: selectionPoint)

// Returns the `PDFSelection` for the entire line based on the point in the coordinate space provided.
let lineSelection = page.selectionForLine(at: selectionPoint)

let anotherSelectionPoint = CGPoint(x: 180, y: 240)

// Will create a selection using only the characters occurring between the points given.
let precisionSelection = page.selection(from: selectionPoint, to: anotherSelectionPoint)

// Creates a selection based on the range for the characters provided.
let rangeSelection = page.selection(for: NSRange(location: 3, length: 42))

If you’re not using PDFKit(opens in a new tab), you can instead use CGPDFScanner(opens in a new tab) along with other Core Graphics APIs to parse a document. This requires in-depth knowledge about the structure of PDF documents(opens in a new tab). Extracting text (and other information) using Core Graphics is beyond the scope of this post, but Apple’s documentation on PDF document parsing(opens in a new tab) is a good starting point.

Conclusion

There are two system APIs, PDFKit and Core Graphics, that can be used to extract text from a PDF. As seen above, PDFKit has convenient APIs for working with text that are much easier and far less error-prone than using CGPDFScanner(opens in a new tab), as with PDFKit, one is not required to have intricate knowledge of the structure of PDF documents. However, as mentioned above, you have to take care of the coordinate system conversion yourself.

PSPDFKit for iOS(opens in a new tab) handles all of this automatically when extracting text, while also giving you the flexibility to transform coordinates as required(opens in a new tab). Additionally, PSPDFKit has an Indexed Full-Text Search(opens in a new tab) component, which can be used to perform extremely quick searches across a large set of documents without you having to deal with text extraction, index and database management, or input sanitization.