Indexed full-text PDF search in UWP

Nutrient supports fast and efficient full-text search in PDF documents through PSPDFKit.Search.Library. This document describes how to get started.

Getting started

To start indexing, create a Library and give it a name. You can then add folders that contain PDF files to this named library. The Library will index all the PDFs in those folders.

Here’s a simple example of how to create or open a library and start indexing PDFs in a directory:

// Opening a library creates one if it doesn't already exist.
var library = await Library.OpenLibraryAsync("MyLibrary");

// Find a folder containing PDFs.
var folderPicker = new Windows.Storage.Pickers.FolderPicker();
folderPicker.SuggestedStartLocation = Windows.Storage.Pickers.PickerLocationId.Desktop;
folderPicker.FileTypeFilter.Add("*");

Windows.Storage.StorageFolder folder = await folderPicker.PickSingleFolderAsync();
if (folder != null)
{
  // Queue up the PDFs in the folder for indexing.
  library.EnqueueDocumentsInFolderAsync(folder);
}

The documents will now be indexed in the background.

Alternatively, you can enqueue a List of IDataProvider objects with the EnqueueDocumentsFromProviderAsync method.

Then, you can choose to start querying documents right away or wait until all documents added to the indexer queue have been completed.

Here’s an example of how to wait and then get the list of indexed documents:

// Wait for indexing to finish.
await library.WaitForAllIndexingTasksToFinishAsync();

// Get the list of indexed documents.
var documentUIDs = await library.GetIndexedUidsAsync();

Identifying documents

The documents in the list returned by GetIndexedUidsAsync are represented by a unique ID (UID). When using StorageFiles, this UID is a string compromised of a future access token identifying the folder containing the PDF and the file name of the PDF within that folder. For IDataProviders, the indexed UID is a string simply containing the IDataProvider UID.

Due to the unique restrictions of UWP, when using StorageFiles, it’s essential that you don’t clear the application’s FutureAccessList if you wish to retain your libraries, as this is the only place for the future access token to be recorded.

Moreover, when using DataProviders, neither the streams nor providers themselves are tracked internally, and they need to be managed by your own application.

You can create a PSPDFKit.Document.DocumentSource object for a given document UID using either DocumentSource.CreateFromStorageFileUidAsync or DocumentSource.CreateFromDataProvider, both of which are static methods.

A StorageFile object for the file can be accessed by calling GetFile on the created DocumentSource object. Note that the method will throw an exception if the document referred to can no longer be located.

Here’s an example:

// Get the list of indexed documents.
var documentUIDs = await library.GetIndexedUidsAsync();

foreach (var uid in documentUIDs)
{
  try
  {
    var documentSource = await DocumentSource.CreateFromUidAsync(uid);
    StorageFile file = documentSource.GetFile();
  }
  catch (Exception e)
  {
    // Examine the exception.
  }
}

Both the StorageProvider and DataProvider implementations can be used side by side. StorageProvider UIDs contain the file name, while DataProvider ones are merely numeric, you’re able to easily check when needed. Note the need to maintain a list of all relevant providers:

if (uid.EndsWith(".pdf"))
{
    document = await DocumentSource.CreateFromStorageFileUidAsync(uid);
}
else
{
    document = DocumentSource.CreateFromDataProvider(_providers.Find(provider => provider.Uid == uid));
}

Index and document status

Library allows you to query for the current indexing state.

You can decide to only query the library if all queued documents have been indexed by using IsIndexingAsync(). You may also check the current status of individual documents by using GetIndexDocumentStatusAsync().

Querying the library

To query the library, use the SearchAsync method, supplying it with a LibraryQuery object.

Here’s an example:

// Search all documents in the library for the text "Acme."
var succeeded = await library.SearchAsync(new LibraryQuery("Acme"));

The results of the query are sent to a query result handler, which you must provide to the library.

Here’s an example:

library.OnSearchComplete += MyOnSearchCompleteMethod;

The OnSearchComplete event handler receives a reference to the originating library, along with a dictionary mapping a document UID to a LibraryQueryResult object. Each result object also contains the UID as a property and a list of the page indexes where matching results were found.

If you wish to show preview snippets, you should set the GenerateTextPreviews property in the query object to true. Then, preview text snippets will be delivered to you via the OnSearchPreviewComplete event handler.

Here’s an example:

library.OnSearchPreviewComplete += MyOnSearchPreviewCompleteMethod;

var query = new LibraryQuery("Acme")
{
  GenerateTextPreviews = true
}
var succeeded = await library.SearchAsync(query);

The OnSearchPreviewComplete event handler receives a reference to the originating library, along with a list of LibraryPreviewResult objects — one for each match. Each of these objects contains a UID identifying the document, a page index where the matching text is located, a snippet of text surrounding the match, the range of the matched text within the preview snippet, and the page text. Each object also has an annotation ID indicating whether or not the match was found in an annotation.

Advanced matching options

Library offers advanced matching options. You can set these options in a LibraryQuery object.

Password-protected documents

When indexing documents, it’s possible you might come across a password-protected document.

You can unlock a password-protected document with an event handler, which is fired every time a password is required. The following example shows how this is possible:

private Library _library;

internal async void Initialize(PdfView pdfView)
{
    _library = await Library.OpenLibraryAsync("catalog");
    _library.OnPasswordRequested += Library_OnPasswordRequested;
}

private void Library_OnPasswordRequested(Library sender, PasswordRequest passwordRequest)
{
    if (passwordRequest.Uid.Contains("Password.pdf"))
    {
        passwordRequest.Password = "test123";
        break;
    }

    passwordRequest.Deferral.Complete();
}

PasswordRequest will always have the UID populated with the path being indexed (note the full path will be assigned with the future access token and the file name). Check against this string to determine which document requires a password and populate the Password member of PasswordRequest to unlock the document. Ensure the Deferral is completed, as per the last line of Library_OnPasswordRequested; otherwise, the index will fail and throw an exception.

Example code

You’ll find a complete working code example in the Catalog app provided with the SDK.