Blog post

How to redact personal information from a PDF in .NET

Illustration: How to redact personal information from a PDF in .NET

With the world moving toward a paperless society, more and more personal information is being stored in digital formats. Digital formats are great for enhancing searchability, storage, and simple distribution, but there are also drawbacks to having all this information in one place. For example, what if you have personal information you’d like to remove before sending a document to a client, company, or retailer? Well that’s where redaction comes in. And today we’re going see how we can permanently redact sensitive information from PDFs using .NET.

What is redaction?

Redaction is the process of removing content from a document, obscuring the content from view, and getting rid of any digital references to the data.

When working with Nutrient, redaction is accomplished in the following steps:

  1. Mark areas for redaction — Add redaction annotations as described in the PDF specification. Visually, the staged redactions will be marked on the document with a box outlined in red.

  2. Remove the content — In this step, the page content within the region of the redaction annotations is irreversibly removed.

How can we redact in .NET?

With the release of Nutrient Libraries 1.1, we added a new RedactionProcessor API in both .NET and Java. This API has the ability to handle a variety of redaction use cases — from removing simple strings, to getting rid of complex data with preset search algorithms.

Redacting common data

In our first use case, we want to remove some personal information, and to do this, Nutrient .NET SDK offers handy presets for identifying common data like phone numbers, credit card numbers, dates, and zip codes. You don’t have to know any technical details about the data formats; you just call the preset:

RedactionProcessor.Create()
    .AddRedactionTemplates(new[] {new RedactionPreset {Preset = RedactionPreset.Type.EmailAddress}})
    .Redact(document);

The code above will identify all email addresses in the document in question and irreversibly remove them.

Redacting a specific string

But what if we want to search for a specific name?

RedactionProcessor.Create()
    .AddRedactionTemplates(new[] {new RedactionRegEx {Pattern = "John Smith"}})
    .Redact(document);

In the example above, the API will search for the pattern John Smith and identify all instances of it for removal. It should be noted that this string is a regular expression pattern, so we could make this search much more intelligent by accounting for various permutations of the name John Smith:

(?:(?:[Jj]ohn) (?:[Ss]mith))|(?:[Jj]ohn)|(?:[Ss]mith)

The regular expression would allow for only John, only Smith, or John Smith, and it would also be case insensitive for the first letter of the first and last name.

Every OS uses a different regular expression implementation. However, they are all using an ICU implementation, which is based on the implementation of regular expressions in Perl.

Only mark areas for redaction

A very common use case is to review the marked areas before committing redactions. This could be done programmatically or via human interaction.

Stopping the RedactionProcessor API before the redaction stage is simple. All you need to do is call IdentifyAndAddRedactionAnnotations rather than Redact:

RedactionProcessor.Create()
    .AddRedactionTemplates(new[] {new RedactionPreset() {Preset = RedactionPreset.Type.SocialSecurityNumber}})
    .IdentifyAndAddRedactionAnnotations(document);

It would now be possible to perform any extra operations on the document before proceeding to save with the redactions either marked for redaction or applied:

// Save and only mark annotations for redaction.
document.Save(new DocumentSaveOptions());

Or:

// Save and redact the marked content.
document.Save(new DocumentSaveOptions
{
    applyRedactionAnnotations = true
});

Redact multiple pieces of data

It’s likely that if you’re redacting one piece of information, you want to redact multiple pieces of information. So bringing together all that we learnt above, we can build a “redaction shape” to remove all pieces of sensitive information:

var isInternationalDocument = false; // Drives logic behind redacting international phone numbers.
var redactionShape = new List<RedactionTemplate>
{
    new RedactionPreset {Preset = RedactionPreset.Type.EmailAddress},
    new RedactionPreset {Preset = RedactionPreset.Type.UsZipCode},
    new RedactionPreset {Preset = RedactionPreset.Type.NorthAmericanPhoneNumber},
    new RedactionPreset {Preset = RedactionPreset.Type.SocialSecurityNumber},
    new RedactionRegEx {Pattern = "(?:(?:[Jj]ohn) (?:[Ss]mith))|(?:[Jj]ohn)|(?:[Ss]mith)"}
};

var processor = RedactionProcessor.Create().AddRedactionTemplates(redactionShape);

if (isInternationalDocument)
{
    processor.AddRedactionTemplates(new[]
        {new RedactionPreset {Preset = RedactionPreset.Type.InternationalPhoneNumber}});
}

processor.Redact(document);

In the example above, we build up the redaction shape by including searches for multiple pieces of sensitive information, and we even add an extra redaction search based on whether or not we’re looking at an international document. When Redact is called, multiple searches will take place and result in the redaction of all types of sensitive information.

Document with multiple items of data redacted using redaction shape

If there are any custom types of information in the document, it’s possible to build your own search term with the RedactionRegEx template. This information could include account numbers, addresses, company names, etc.

Conclusion

In this blog post, we outlined the new RedactionProcessor API released in Nutrient .NET SDK 1.1. We redacted some simple information and then built up a larger template of redactions to search for. With this new API, it’s possible to adapt the code further and redact batches of documents or construct a redaction template from some external data source, like a database. The API is easy yet flexible, in order to suit the needs of many different use cases. Why not try it today?

FAQ

Here are a few frequently asked questions about redacting in .NET.

What is the purpose of redaction in PDFs? Redaction is used to permanently remove sensitive or confidential information from a PDF to protect privacy.
How does the RedactionProcessor API in .NET work? The RedactionProcessor API in .NET marks areas for redaction and then permanently removes the content within those areas.
Can I specify multiple types of data to redact at once? Yes, you can specify multiple data types — such as emails, phone numbers, and social security numbers — using the RedactionProcessor API.
Is it possible to review redactions before applying them? Yes, you can mark areas for redaction first, review them, and then choose to apply or discard the redactions.
Can I use custom patterns for redaction? Yes, with regular expressions, you can create custom patterns to redact specific data not covered by preset options.
Author
Nick Winder
Nick Winder Core Engineer

When Nick started tinkering with guitar effects pedals, he didn’t realize it’d take him all the way to a career in software. He has worked on products that communicate with space, blast Metallica to packed stadiums, and enable millions to use documents through Nutrient, but in his personal life, he enjoys the simplicity of running in the mountains.

Explore related topics

Free trial Ready to get started?
Free trial