Extracting Images from a PDF in Swift

Stefan Kieleithner

March 25, 2019

Sometimes when dealing with a PDF, you might find an image embedded on a page that you would like to have extracted as an image file for storing or editing. In this post, we will go over potential approaches to extracting an embedded image from a document page using Swift.

Note that extracting images is different than rendering a PDF page as an image. An embedded image is only a single part of a PDF page, which means it needs to be queried and extracted from the internal PDF structure.

Also be aware that not all content from a PDF that looks like an image can be extracted via the below methods. While many things may look like images, some contents might actually be represented as vector graphics in the PDF, and as such, can’t be extracted via the following approaches.

Using Core Graphics

The first method we’ll cover is how extracting embedded images can be done with the system-provided API, and we’ll use the PDF support of CoreGraphics for this. To use this API, it’s necessary to familiarize ourselves a bit with the PDF structure, since we will need to extract the images from the PDF file structure manually.

We use CGPDFDocument(opens in a new tab) to open the document and get the appropriate CGPDFPage(opens in a new tab) where the image we want to extract is placed. Then we get the Resources dictionary entry, which is represented by a CGPDFDictionary(opens in a new tab) object. Traversing the PDF tree further, we retrieve the XObject entry, which might contain multiple elements on a PDF page, including images. Then we try to get the image stream — a CGPDFStream(opens in a new tab) object — of the first image on the page, which might be represented by the Im0 entry. Finally, we extract the image data from the stream object, which we in turn convert into a UIImage object. Simplified, this can be expressed as the following:

CGPDFPage -> Resources -> XObject -> Im0

All of this can be done with logic like this:

func getEmbeddedImage(ofPDFAt url: URL, pageIndex: Int) -> UIImage? {
    guard let document = CGPDFDocument(url as CFURL) else {
        print("Couldn't open PDF.")
        return nil
    }
    // `page(at:)` uses pages numbered starting at 1.
    let page = pageIndex + 1
    guard let pdfPage = document.page(at: page), let dictionary = pdfPage.dictionary else {
        print("Couldn't open page.")
        return nil
    }
    var res: CGPDFDictionaryRef?
    guard CGPDFDictionaryGetDictionary(dictionary, "Resources", &res), let resources = res else {
        print("Couldn't get Resources.")
        return nil
    }
    var xObj: CGPDFDictionaryRef?
    guard CGPDFDictionaryGetDictionary(resources, "XObject", &xObj), let xObject = xObj else {
        print("Couldn't load page XObject.")
        return nil
    }
    var stream: CGPDFStreamRef?
    guard CGPDFDictionaryGetStream(xObject, "Im0", &stream), let imageStream = stream else {
        print("No image on PDF page.")
        return nil
    }
    var format: CGPDFDataFormat = .raw
    guard let data = CGPDFStreamCopyData(imageStream, &format) else {
        print("Couldn't convert image stream to data.")
        return nil
    }
    let image = UIImage(data: data as Data)
    return image
}

This approach usually succeeds and yields a result if the page has an embedded image. However, be aware that this might not always be the case, since the Im0 entry is not guaranteed to be used for the first image. As a result, this logic might not always return the correct image, or any image at all, as we only check for the image with the key Im0. In the case where the PDF page doesn’t have an image with the key Im0, but instead with some other key, this logic would not detect it.

To cover all possible cases, we would need to iterate over all the objects that are available in the XObject element, check if the Subtype entry is of type Image (since this indicates that the object is an image), and extract the image for those entries. So, to make sure to extract all available images of a PDF page, we could use this logic:

var imageKeys = [String]()
CGPDFDictionaryApplyBlock(xObject, { key, object, _ in
    var stream: CGPDFStreamRef?
    guard CGPDFObjectGetValue(object, .stream, &stream),
        let objectStream = stream,
        let streamDictionary = CGPDFStreamGetDictionary(objectStream) else { return true }
    var subtype: UnsafePointer<Int8>?
    guard CGPDFDictionaryGetName(streamDictionary, "Subtype", &subtype), let subtypeName = subtype else { return true }
    if String(cString: subtypeName) == "Image" {
        imageKeys.append(String(cString: key))
    }
    return true
}, nil)

let allPageImages = imageKeys.compactMap { imageKey -> UIImage? in
    var stream: CGPDFStreamRef?
    guard CGPDFDictionaryGetStream(xObject, imageKey, &stream), let imageStream = stream else {
        print("Couldn't get image stream.")
        return nil
    }
    var format: CGPDFDataFormat = .raw
    guard let data = CGPDFStreamCopyData(imageStream, &format) else {
        print("Couldn't convert image stream to data.")
        return nil
    }
    guard let image = UIImage(data: data as Data) else {
        print("Couldn't convert image data to image.")
        return nil
    }
    return image
}

Note that CGPDFDictionaryApplyBlock(opens in a new tab) is only available from iOS 12 on. For previous versions, you can instead use CGPDFDictionaryApplyFunction(opens in a new tab), which requires a bit more overhead than its newer counterpart, but works just as fine.

Using PDFKit

While most of the above logic is not possible in PDFKit, as it doesn’t provide APIs that are low-level enough to extract images, you can at least query the CGPDFPage(opens in a new tab) instance via the pageRef(opens in a new tab) property on PDFPage(opens in a new tab), which in turn can be accessed via page(at:)(opens in a new tab) from the PDFDocument(opens in a new tab). This will enable you to execute the logic mentioned in the section above.

Using PSPDFKit

PSPDFKit for iOS has APIs available for extracting objects in a PDF without requiring any knowledge about the internal structure of the PDF. This includes support for extracting embedded images of a page. To retrieve the images, you first need to get the PSPDFTextParser(opens in a new tab) for the appropriate page where the image you would like to extract is located. The text parser already has a ready-to-use images(opens in a new tab) property that returns PSPDFImageInfo(opens in a new tab) values for all the available images on the page. Note that, for performance reasons, this property does not directly return all UIImage objects, since querying these will require additional parsing of the PDF, and in some cases, getting only the metadata for these images is already enough. But in our case, we are fine with further parsing of the PDF to extract the actual image data, so we call imageInRGBColorSpace()(opens in a new tab), which extracts the image data from the page and creates a UIImage representation of the data.

Extracting all the images on a single PDF page can be done like this:

func getEmbeddedImages(ofPDFAt url: URL, pageIndex: PageIndex) -> [UIImage] {
    let document = PSPDFDocument(url: url)
    guard let textParser = document.textParserForPage(at: pageIndex) else {
        print("Failed to get text parser.")
        return []
    }
    return textParser.images.compactMap { imageInfo  in
        return try? imageInfo.imageInRGBColorSpace()
    }
}

Conclusion

In this post, we explored two methods of how embedded images can be obtained from a PDF document in Swift. While extracting images from a PDF is supported via CGPDF APIs that are already available by default on iOS, implementing the actual logic to handle the extraction is not that straightforward, and this requires quite a bit of knowledge about the underlying structure of a PDF. So if you don’t want to worry about all these technical details, PSPDFKit can help. With our APIs, retrieving embedded images from a PDF page is simply a matter of a few lines of code.