Blog Post

Is My Document a Valid PDF?

Illustration: Is My Document a Valid PDF?
Information

This article was first published in July 2019 and was updated in August 2024.

PDF documents are widely used for their ability to faithfully represent and preserve information. However, determining whether a document has an invalid PDF format is crucial for ensuring it can be correctly processed. In this blog post, we’ll cover the basics of identifying an invalid PDF format and see how PSPDFKit handles such cases.

What Is a PDF?

From a technical perspective, a PDF is a file format with a special syntax that must be adhered to. Conceptually, it represents data whose integrity we want to preserve across different systems. Understanding this distinction is vital when checking if a PDF document has an invalid PDF format. A file might have valid PDF syntax but still be considered invalid if it has other issues.

How Can a PDF Become Invalid?

A PDF can be deemed invalid for several reasons, such as:

  • No pages — A PDF without page information is invalid.

  • Encryption — Encrypted PDFs are considered invalid until decrypted.

  • Missing header — A valid PDF must include a header defining the specification version within the first 1,024 bytes. Missing this header renders the PDF invalid.

The PDF specification doesn’t explicitly detail how to determine an invalid PDF format, which leaves software vendors to use their judgment. PSPDFKit, for example, deems a PDF invalid if it’s encrypted or if it fails certain internal checks.

The important thing to note is that the official PDF specification doesn’t provide explicit checks for software to know how a PDF can be determined to be invalid. In the first section, Scope, it states:

This standard does not specify the following:

  • specific processes for converting paper or electronic documents to the PDF format;
  • specific technical design, user interface or implementation or operational details of rendering;
  • specific physical methods of storing these documents such as media and storage conditions;
  • methods for validating the conformance of PDF files or readers;
  • required computer hardware and/or operating system.

This leaves a gaping hole for PDF software vendors, and it requires that they use their best judgement to determine in which instances a PDF file can be considered invalid. In our case, within the context of PSPDFKit, we also deem a PDF invalid if it’s encrypted, due to the fact that you effectively can’t interact with it until it’s unlocked.

Things can get even more complicated if we consider other file format standards related to PDF. One such example of this is PDF/A, which is another ISO standard that’s specialized in the archiving and long-term preservation of electronic documents.

PDF/A comprises a set of really specific ways in which data needs to be laid out to accomplish its goal. Because of that, a whole new level of complexity is added for us to be able to determine whether or not a PDF/A is valid. As a result, there are even specialized tools, such as the Isartor Test Suite and veraPDF, that are tasked with developing tests that can be used as a starting point for creating validation software for this specific format.

Using PSPDFKit to Validate PDFs

At PSPDFKit, we take a rather pragmatic approach to checking if we can work with a file as a PDF or not. Internally, PSPDFKit performs a series of checks to determine if a PDF is valid:

  1. Is this even a PDF? — We look for the %PDF- directive in the file header. If this is missing, we abort any subsequent operations, as we can’t rely on the file to contain PDF syntax.

  2. Is the file large enough to be a valid PDF? — We check the total file size to see if it’s larger than the size of the header (%PDF) and the end-of-file marker (%%EOF) added together. If this test fails, the file is automatically deemed invalid.

  3. Do we have an end-of-file marker at all? — We’ll try to load the last 1,024 bytes of the file to look for an %%EOF marker. Not having an %%EOF marker makes the file an invalid PDF.

  4. Does the file contain more PDF syntax after %%EOF? — If this is the case, then we’re dealing with a malformed file, and trying to perform any other operations with it would be a waste of resources, so we say this case is also grounds to deem a PDF invalid.

From an end user perspective, it’s easy to see if a PDF is valid or not: if it is, you’ll see it displayed onscreen. If it’s not, you’ll see a message like the one below.

If you’d like to do a manual check on a document before even attempting to present it, you can do so as follows:

let url = // Document URL.
let document = PSPDFDocument(url: url)

// Check if the document is valid before continuing.
guard document.isValid else {
	// Perform appropriate cleanup actions.
	return
}
NSURL *url = // Document URL.
PSPDFDocument *document = [[PSPDFDocument alloc]] initWithURL:url];

// Check if the document is valid before continuing.
if (!document.isValid) {
	// Perform appropriate cleanup actions.
	return;
}

Calling PSPDFDocument.isValid will lazily load the document. If the document is valid and we were able to parse it correctly, then the document’s pages will be available to us.

Conclusion

As we saw in this post, there are multiple aspects to consider when determining whether or not a PDF is valid. Given the broad field of applications the PDF format has, it can be very difficult to come to an agreement of what exactly constitutes a “valid” PDF.

At PSPDFKit, we interpret the PDF specification as closely as we can to make sure we can deliver the reliability our customers expect from us. Nevertheless, as with many aspects of dealing with PDF technologies, this is an ongoing effort, and we’ll always be looking to improve the ways in which we can provide the best experience possible.

FAQ

Here are a few frequently asked questions about determining if a PDF is valid.

What are the most common signs of an invalid PDF format?

Common signs of an invalid PDF format include missing pages, corrupted content, an unreadable file header, or an absence of the end-of-file marker (%%EOF). If the PDF fails to open or display correctly, it may be due to these issues.

How can I check if a PDF is encrypted?

A PDF is considered encrypted if it requires a password to open or if it isn’t accessible until decrypted. Most PDF readers will prompt for a password if encryption is present. In PSPDFKit, encrypted PDFs are treated as invalid until they’re decrypted.

What tools can I use to validate a PDF file?

Tools like PSPDFKit can validate PDF files by checking the file header, end-of-file marker, and overall file size. For specific formats like PDF/A, specialized tools such as the Isartor Test Suite and veraPDF are available for more detailed validation.

Can an invalid PDF format be repaired?

Repairing an invalid PDF can be challenging, depending on the issue. Some tools may attempt to fix structural problems, but if the file is severely corrupted or malformed, it may not be recoverable.

How can I prevent my PDFs from becoming invalid?

To prevent PDFs from becoming invalid, ensure that you use reliable software for creating and handling PDFs, adhere to proper PDF standards, and verify the integrity of your files before distribution.

Share post
Free trial Ready to get started?
Free trial