Data-Driven Redaction in Web
There are various situations in which people might want to remove sensitive information from documents before distributing them. One such example — which we’ll walk through in this post — is of a credit card company that wants to run an internal audit of card application form submissions. The company likely only cares about the credit limit a customer is applying for and what the customer’s address is. In order to keep the entire process secure, it’s important to get rid of all other personal information (name, phone number, email address, etc.) on the forms before handing them over to the team that does the auditing. This process of removing content from a page is known as redaction.
Redacting a PDF
Let’s assume that all the credit card forms are in the PDF format. To remove sensitive information from PDFs, we not only have to obscure the parts we want to redact, but we also have to remove all the references to the internal structure of the PDF related to that data.
This redaction process involves two steps:
-
Marking the areas on the document that we want to remove by adding redaction annotations.
-
Removing the contents of those areas. This step is irreversible.
Redaction on Web
We added support for redaction in our web SDK starting with v2020.3.0. There are different ways we can apply redactions, which you can read about in our guides. For the purpose of this post, we’ll use APIs to programmatically remove sensitive information from a document.
Reading Data from a File
In scenarios like the one above, it’s not practical to redact each PDF manually. So instead, we’ll have to batch process all the files. We can keep the mapping of all the sensitive information we want to remove in a CSV or JSON file and use the files as the data source while we do the batch processing.
If you’re using a CSV file, you can use any one of the various CSV parsers published on npm to convert the files to JSON, since it’s a format that’s friendlier for web. Once the data is converted to JSON, it can be imported directly into your code.
For this example, we’ll use the following fake CSV:
Joanna R, +1-541-754-3010, [email protected]
The PDF used for this example can be found as part of our Redaction Catalog example.
Redacting the Data
Once we have the list of data we want to remove from a document in the JSON format, we can use the Instance#createRedactionsBySearch API to mark the areas we want to redact:
const stringsToBeRedacted = [ "Joanna R", "+1-541-754-3010", "[email protected]" ]; const instance = await PSPDFKit.load(options); const redactionPromises = stringsToBeRedacted.map(str => instance.createRedactionsBySearch(str) ); await Promise.all(redactionPromises); // Now all the redaction annotations have been added.
The above code adds redaction annotations to the areas that need to be redacted. Now we have to use Instance#applyRedactions
to remove them:
await instance.applyRedactions();
This will remove all the occurrences of Joanna R, +1-541-754-3010, and jo.simonitti89[at]gmail.com, along with any data that was imported from the JSON or CSV.
Verifying the Data Removal
Since the redaction process is irreversible, there might be cases where you want someone to verify the data to be removed before it is redacted. In such a case, you can add the redaction annotations on the document and have someone confirm they’re correct before applying the redactions. Keep in mind that adding redaction annotations doesn’t remove the actual data; it just marks the positions on the page that will be redacted once you call Instance#applyRedactions
. You can delete or modify redaction annotations before applying the permanent redactions.
Conclusion
In this blog post, we went over how to import data from an external source and use it to remove data from a PDF document. This is especially helpful when automating the redaction process. If you want to learn more about the different ways you can create and apply redactions with PSPDFKit for Web, head over to our Levels of Redaction Automation blog post.
If you’re interested in checking out the Redaction component, head over to our Redaction Catalog example. Redaction is a feature that has to be purchased separately. If it’s not part of your license, the APIs and the UI changes mentioned above won’t function as described. Please ping our sales team if you’re interested in integrating this feature into your application.