Custom Tokenizers for PDF Search on iOS
PSPDFKit uses SQLite to build the full-text index used in PDFLibrary
and PDFDocumentPickerController
, and also for various other data-saving operations (like the image cache metadata). PSPDFKit doesn’t ship with its own SQLite version, and instead it uses the one that’s already in iOS. PSPDFKit also supports custom SQLite builds.
By default, PDFLibrary
uses its own tokenizer, which works well for many languages, including Chinese, Japanese, and Korean (CJK). It also enables searching for related words, e.g. finding “dependencies” when searching for “depending.” This is implemented by the PSPDFLibraryPorterTokenizerName
.
When should you ship your own build of SQLite?
-
When you want better indexing performance
-
When you need features only available in a newer version of SQLite
-
When you need better performance for exact word or phrase matches
If you rely a lot on exact word or phrase matches, the default tokenizer set by PDFLibrary
might not be optimal and you should consider switching to a custom one.
By default, PSPDFKit uses a custom tokenizer for building the full-text search (FTS) index that can deal with CJK characters as well. Alternatively, we ship another custom tokenizer, which is referenced by the PDFLibrary.UnicodeTokenizerName
identifier. This tokenizer is a wrapper around SQLite’s unicode61
tokenizer, but it performs full case folding. This is useful in cases where the document being indexed has text like Straße
and you’d like it to match when searching for strasse
.
You can also use the custom tokenizers shipped with SQLite itself, like the unicode61
or icu
tokenizers.
Tokenizer | Minimum FTS Version | Minimum SQLite Version |
---|---|---|
PSPDFLibraryPorterTokenizerName | FTS4 | 3.7.4 |
PDFLibrary.UnicodeTokenizerName | FTS5 | 3.9.0 |
unicode61 |
FTS4 | 3.7.13 |
Note that simply linking the correct SQLite version with your application isn’t enough: You must ensure that the linked SQLite is built with the correct flags to enable FTS4 or FTS5. Trying to enable a tokenizer on an unsupported FTS version will result in the initialization of PDFLibrary
failing:
do { let library = try PDFLibrary(path: PDFLibrary.defaultLibraryPath(), tokenizer: "unicode61") let documentPicker = PDFDocumentPickerController(directory: "/path/to/files", includeSubdirectories: true, library: library) } catch { // Handle error. }
PSPDFLibrary *library = [PSPDFLibrary libraryWithPath:PSPDFLibrary.defaultLibraryPath tokenizer:@"unicode61" error:NULL]; PSPDFDocumentPickerController *documentPicker = [[PSPDFDocumentPickerController alloc] initWithDirectory:@"/path/to/files" includeSubdirectories:YES library:library];
Optionally, you can also ship your own version of SQLite. To do so, please do the following. In the PSPDFKit.dmg
you downloaded, you’ll find a current version of SQLite in the Extras folder that’s already prepared to be linked. Add SQLite.xcodeproj
to your Xcode project, and then add libSQLite.a
as a Target Dependency and under Link Binary with Libraries. Make sure you don’t link the libsqlite3.tbd
library.
ℹ️ Note: You’ll have to delete your app, or at least the library file, so that the index is fully rebuilt after a different tokenizer has been set.