PDF documents can contain all kinds of sensitive information, and in certain scenarios, it’s necessary to remove this information. So in this blog post, I’ll explain how to leverage the redaction functionality of PSPDFKit for Android to effectively remove text patterns from documents — all without the documents leaving the user’s device.
What Is Redaction?
Before showing how to redact documents programmatically, I’ll first cover what redaction is. PDF documents can contain sensitive information such as addresses, credit card and social security numbers, phone numbers, emails, etc. The process of irreversibly removing information from a PDF is called redaction.
Adding a black rectangle on top of a document’s content in a PDF editor is not enough to guarantee the privacy of sensitive data. This is because PDF pages contain multiple layers of data, so covering content on a page doesn’t mean the original data can’t be extracted from the PDF file itself.
You can read a more in-depth introduction to redaction on our blog.
Redaction using PSPDFKit is a two-step process:
-
First, create redaction annotations to mark areas that should be redacted. This doesn’t remove information from the document, which allows for multiple collaborators to work on preparing the document for redaction.
-
Then the redaction annotations need to be applied, which rewrites the new document data without the redacted contents.
PSPDFKit ships with a full UI for creating and applying redactions.
This post will focus on redacting documents programmatically. My goal is to explain how to redact a document by searching for text and patterns that should be redacted.
Creating Redactions from Search
To start, we’ll open a PDF document:
val document = PdfDocumentLoader.openDocument(context, documentUri)
PSPDFKit’s search API provides programmatic text search inside page text and annotations. Let’s use it to search for specific text:
// First, create a `TextSearch` instance. val textSearch = TextSearch(document, PdfConfiguration.Builder().build()) // Then perform a search for the query 'pspdfkit'. val searchResults = textSearch.performSearch("pspdfkit")
We now have a list of search results. Every search result consists of a page number and text block containing the PDF coordinates of the found text on a page. Redaction annotations can be directly created from these PDF coordinates:
val redactionAnnotations = searchResults.map { searchResult -> // Create redaction annotations covering the page rectangles of the search result. return@map RedactionAnnotation(searchResult.pageIndex, searchResult.textBlock.pageRects) }
Finally, we’ll add these annotations to the document. This will mark all instances of the searched text for redaction:
redactionAnnotations.forEach { redaction -> document.annotationProvider.addAnnotationToPage(redaction)
We’re now done with the first part of the redaction process, so we’ll continue with applying the redactions to the document.
Applying Redactions
There are two separate ways to apply redactions programmatically.
We can apply redactions when saving a document. All we need to do is use the DocumentSaveOptions#setApplyRedactions()
save option:
val documentSaveOptions = document.defaultDocumentSaveOptions documentSaveOptions.setApplyRedactions(true) document.save(outputFilePath, documentSaveOptions) // Alternatively, you can rewrite the original document file via // `document.save(documentSaveOptions)`.
It’s also possible to use the PDF processor to apply redactions. This approach is more flexible than saving a document, as it allows us to use the entire range of PDF processing operations provided by the PDF processor. To apply the redactions, use the PdfProcessorTask
created from the source document with the applyRedactions()
operation:
val outputFile: File val processorTask = PdfProcessorTask.fromDocument(document) .applyRedactions() PdfProcessor.processDocument(processorTask, outputFile)
Once we’re done applying the redactions, we can preview the redacted document by opening it in PdfActivity
:
val intent = PdfActivityIntentBuilder.fromUri(context, Uri.fromFile(outputFile))
.configuration(configuration)
.build()
startActivity(intent)
Searching for Patterns
I’ve already shown how to redact by searching for an exact snippet of text. But the text search API also supports searching for regular expression patterns, which allows us to provide a powerful redaction API.
Regular expression search is disabled by default. To enable it, use the REGULAR_EXPRESSION
compare option inside the search options:
val searchOptions = SearchOptions.Builder().compareOptions(CompareOptions.REGULAR_EXPRESSION).build()
ℹ️ Note: For a comprehensive list of all search options, refer to our search guides.
Let’s see regex search in action. We’ll search for all email addresses in the document as an example. For this purpose, we’ll use the battle-tested regex from emailregex.com:
val emailRegex = "(?:[a-z0-9!#\$%&'*+/=?^_`{|}~-]+(?:\\.[a-z0-9!#\$%&'*+/=?^_`{|}~-]+)*|\"(?:[\\x01-\\x08\\x0b\\x0c\\x0e-\\x1f\\x21\\x23-\\x5b\\x5d-\\x7f]|\\\\[\\x01-\\x09\\x0b\\x0c\\x0e-\\x7f])*\")@(?:(?:[a-z0-9](?:[a-z0-9-]*[a-z0-9])?\\.)+[a-z0-9](?:[a-z0-9-]*[a-z0-9])?|\\[(?:(?:25[0-5]|2[0-4][0-9]|[01]?[0-9][0-9]?)\\.){3}(?:25[0-5]|2[0-4][0-9]|[01]?[0-9][0-9]?|[a-z0-9-]*[a-z0-9]:(?:[\\x01-\\x08\\x0b\\x0c\\x0e-\\x1f\\x21-\\x5a\\x53-\\x7f]|\\\\[\\x01-\\x09\\x0b\\x0c\\x0e-\\x7f])+)\\])" val textSearch = TextSearch(document, PdfConfiguration.Builder().build()) val searchOptions = SearchOptions.Builder().compareOptions(CompareOptions.REGULAR_EXPRESSION).build() val searchResults = textSearch.performSearch(emailRegex, searchOptions)
ℹ️ Note:
TextSearch
uses regular expression capabilities available in Java. Refer to thePattern
documentation for the supported regular expression syntax.
Asynchronous Methods
Running text search and document processing on the main thread could take a while and could even lead to an ANR warning in your app. As such, the previous code example assumes you’re executing code on a background thread.
PSPDFKit ships with RxJava-based asynchronous versions for most of its APIs. Here’s the asynchronous version of the code above using this API:
val textSearch = TextSearch(document, PdfConfiguration.Builder().build()) textSearch.performSearchAsync(searchTerm, searchOptions) // Execute the text search on the background thread. .subscribeOn(Schedulers.computation()) // Create redaction annotations covering the page rectangles of the search result. .map { searchResult -> RedactionAnnotation(searchResult.pageIndex, searchResult.textBlock.pageRects) } // Create redaction annotations. .flatMapCompletable { redaction -> document.annotationProvider.addAnnotationToPageAsync(redaction) } // And process the document. .andThen(Flowable.defer<ProcessorProgress> { val task = PdfProcessorTask.fromDocument(document).applyRedactions() return@defer PdfProcessor.processDocumentAsync(task, outputFile) // Drop update events to avoid backpressure issues on slow devices. .onBackpressureDrop() .subscribeOn(Schedulers.io()) }) // Observe results on the main thread to allow interacting with the UI. .observeOn(AndroidSchedulers.mainThread()) .doOnComplete { // Show the redacted document when the processing completes. } .subscribe ({ processorProgress -> // Update the UI with the processing progress. })
If you’re not comfortable with a reactive way of thinking, you can still use the non-asynchronous API from a background thread with the standard threading primitives. For more information on this topic, refer to the Android introduction to this topic.
Conclusion
In this blog post, I showed how to create batches of redactions in your Android apps using PSPDFKit for Android. This is only a small example of PSPDFKit’s programmatic API, which provides powerful building blocks that make it possible to achieve all the PDF processing needed by your business use case.
But PSPDFKit for Android is only a small part of it. PSPDFKit supports redactions across all platforms, including server-side products: PSPDFKit Server and the .NET and Java libraries can be used when you need server-side redaction.