Asset Storage Configuration

Document Engine supports multiple storage backends for PDFs and other assets, as detailed below.

Built-In Asset Storage

By default, Document Engine stores assets as Binary Large OBjects (BLOBs) in the database. For production environments, especially if you have bigger individual PDFs, we recommend using object storage. We currently support Amazon S3-compatible object storage and Azure Blob Storage.

Set ASSET_STORAGE_BACKEND to built-in to use the built-in asset storage. When deploying with Helm, use the pspdfkit.storage.assetStorageBackend value.

S3-Compatible Object Storage

Document Engine can also store your assets in any Amazon S3-compatible object storage service.

Configuration

Set ASSET_STORAGE_BACKEND to S3. Other configuration options depend on whether you’re using AWS S3 or another S3-compatible storage provider.

Here are the available parameters as Helm values:

assetStorage:
  # `ASSET_STORAGE_BACKEND`: `built-in`, `s3` or `azure`
  assetStorageBackend: s3
  # S3 backend storage settings, in case `pspdfkit.storage.assetStorageBackend` is set to `s3
  s3:
    # `ASSET_STORAGE_S3_ACCESS_KEY_ID`
    accessKeyId: "<...>"
    # `ASSET_STORAGE_S3_SECRET_ACCESS_KEY`
    secretAccessKey: "<...>"
    # `ASSET_STORAGE_S3_BUCKET`
    bucket: "<...>"
    # `ASSET_STORAGE_S3_REGION`
    region: "<...>"
    # `ASSET_STORAGE_S3_HOST`
    #host: "os.local"
    # `ASSET_STORAGE_S3_PORT`
    port: 443
    # `ASSET_STORAGE_S3_SCHEME`
    #scheme: "https://"
    # External secret name
    #externalSecretName: ""

AWS S3

When running on S3, you must set the ASSET_STORAGE_S3_BUCKET and ASSET_STORAGE_S3_REGION configuration options to configure the bucket name and region.

If you’re running on AWS, Document Engine will try to resolve access credentials with the following precedence:

We don’t recommend using credentials directly. Instead, consider using role-based permission management, depending upon the underlying platform.

If you’re not running on AWS, you must always set ASSET_STORAGE_S3_ACCESS_KEY_ID and ASSET_STORAGE_S3_SECRET_ACCESS_KEY.

Other S3-Compatible Storage Providers

When using an object storage provider other than Amazon S3, you must always set ASSET_STORAGE_S3_ACCESS_KEY_ID and ASSET_STORAGE_S3_SECRET_ACCESS_KEY. In addition, you can configure the following options:

  • ASSET_STORAGE_S3_HOST — Host name of the storage service.

  • ASSET_STORAGE_S3_PORT — Port used to access the storage service. The default port is 443.

  • ASSET_STORAGE_S3_SCHEME — A URL scheme used when accessing the service, either http:// or https://. The default is https://.

For more details about using Google Cloud Storage as the storage backend, take a look at the Google Cloud Storage interoperability guide.

Bucket and Key Policy

If you’re using AWS S3, the IAM identity used by Document Engine needs the following permissions:

  • s3:ListBucket on the configured bucket

  • s3:PutObject on all objects in the bucket (<bucket-arn>/*)

  • s3:GetObjectAcl on all objects in the bucket (<bucket-arn>/*)

  • s3:GetObject on all objects in the bucket (<bucket-arn>/*)

  • s3:DeleteObject on all objects in the bucket (<bucket-arn>/*)

If you’re using Server-side encryption with Key Management Service, the following actions must be allowed on the encryption key:

  • kms:Decrypt

  • kms:Encrypt

  • kms:GenerateDataKey

Timeouts

Note that all operations on the S3 bucket have a timeout of 30 seconds.

Azure Blob Storage

Document Engine can store your assets in Azure Blob Storage.

Configuration

To configure Azure Blob Storage as the default asset store, set ASSET_STORAGE_BACKEND to azure in your Document Engine configuration. You also need to provide the following configuration options:

  • AZURE_STORAGE_ACCOUNT_NAME

  • AZURE_STORAGE_ACCOUNT_KEY

  • AZURE_STORAGE_DEFAULT_CONTAINER

Alternatively, instead of providing AZURE_STORAGE_ACCOUNT_NAME and AZURE_STORAGE_ACCOUNT_KEY, you can supply a connection string by setting AZURE_STORAGE_ACCOUNT_CONNECTION_STRING.

Here they are as Helm values:

assetStorage:
  # `ASSET_STORAGE_BACKEND`: `built-in`, `s3` or `azure`
  assetStorageBackend: azure
  # S3 backend storage settings, in case `pspdfkit.storage.assetStorageBackend` is set to `s3
  azure:
    # `AZURE_STORAGE_ACCOUNT_NAME`
    accountName: "<...>"
    # `AZURE_STORAGE_ACCOUNT_KEY`
    accountKey: "<...>"
    # `AZURE_STORAGE_DEFAULT_CONTAINER`
    container: "<...>"
    # `AZURE_STORAGE_ACCOUNT_CONNECTION_STRING`, takes priority over `accountName` and `accountKey`
    #connectionString: ""
    # `AZURE_STORAGE_API_URL` for custom endpoints
    #apiUrl: ""
    # External secret name
    #externalSecretName: ""

We recommend using Azurite with Document Engine in development and test environments when the ASSET_STORAGE_BACKEND is set to azure. When using Azurite, you can also configure the URL for the Azure Blob Storage service by setting AZURE_STORAGE_API_URL to the address of the Azurite deployment.

Which Storage Backend Should I Use?

The choice of storage backend depends on the PDF dataset that will power your application, and it impacts the general performance of Document Engine.

If you have a relatively stable number of PDF files (i.e. an amount that only changes a few times a month) with a size of lower than 5 MB each, you can safely use the built-in storage, with the main advantages being that:

  • You don’t have to worry about another piece of infrastructure.

  • Backing up the Document Engine PostgreSQL instance will also back up your assets.

For larger and more frequently changing files, we recommend using the S3-compatible asset storage backend, which provides more efficient support for concurrent uploads and downloads.

Using the S3-compatible backend means you need a separate backup routine, but you should consider that:

  • As Document Engine stores files by their SHA checksums, most of the time, a daily, incremental backup will suffice.

  • Unless you use a backup solution that orchestrates a point-in-time backup across different storage types (e.g. AWS Backup), schedule the asset storage backup right after the PostgreSQL database backup to avoid data drifting between the two.

Serving Files from Existing Storage in Your Infrastructure

If you already have a storage solution for PDF files in your infrastructure, Document Engine can integrate with it as long as the PDF files can be accessed via an HTTP endpoint. When integrating Document Engine and the file storage, you’ll need to add documents from a URL.

All PDF URLs should be considered permalinks, as PSPDFKit will always fetch the file when needed (keeping only a local cached copy that can expire at any time).

Information

Never accept arbitrary user input as a URL for a PDF. Malicious users might leverage this to make Document Engine perform a request on their behalf. This kind of attack, known as Server-Side Request Forgery (SSRF), can be used to interact with services that assume the local network is secure, e.g. cloud automation infrastructure.

To achieve the best possible performance, ensure Document Engine instances and the file store sit in the same network (physical or virtual). This minimizes latency and maximizes download speed.

As of version 2019.4, it’s possible to perform a document editing operation on a document with a remote URL, but the resulting PDF file will need to be stored with any of the supported storage strategies. If you need to copy the transformed file back to the file store, you’ll need to do that manually by fetching the transformed file first.

If your file store requires authentication, we recommend introducing an internal proxy. When adding a document with a URL, the URL would point to the proxy endpoint, where your custom logic would be able to support the required authentication options and redirect to the file store URL of the PDF file. For more information and some sample code, visit the relevant guide article.

Migration between Asset Storage Options

It’s possible to migrate from one storage backend to another one by executing the migration command as described below. To prevent data loss, a migration doesn’t delete files from the original storage backend.

Asset storage backend migrations are incremental. You can interrupt the migration process at any time and resume it later on. This is useful when you have many documents and you’d like to perform the migration only during the time of low load on your system. You can perform the migration while Document Engine is running.

Before you start the migration process, make sure to set the ENABLE_ASSET_STORAGE_FALLBACK configuration option to true and to specify the storage fallbacks you want enabled. This will allow Document Engine to serve assets that haven’t yet been migrated from the old storage backend.

Remember to set it back to false when you’ve finished migrating all the documents, as it introduces a slight decrease in performance of fetching the assets.

At any point, you can inspect how many documents are stored in each asset storage backend from the Storage tab in the Document Engine dashboard.

All configuration options mentioned in this section are also configurable in the Helm chart values.

Migrating to S3 from Built-In Storage

To migrate from the built-in asset storage to S3, follow these steps:

  1. Set the ENABLE_ASSET_STORAGE_FALLBACK configuration option to true.

  2. Enable the built-in database storage as a fallback by setting ENABLE_ASSET_STORAGE_FALLBACK_POSTGRES to true.

  3. Set the ASSET_STORAGE_BACKEND configuration option to s3 and configure the rest of the S3 options.

  4. Run the migration script by executing the pspdfkit assets:migrate:from-built-in-to-s3 command in the Document Engine container.

  • If you use docker-compose, run the following command in the directory where you have your docker-compose.yml file: docker-compose run pspdfkit pspdfkit assets:migrate:from-built-in-to-s3.

  • If you don’t use docker-compose, first find the name of the Document Engine container using docker ps -a. This will list all running containers and their names. Then, run the following command, replacing <container name> with the actual Document Engine container name: docker exec <container name> pspdfkit assets:migrate:from-built-in-to-s3.

  1. When all your documents have been migrated, set the ENABLE_ASSET_STORAGE_FALLBACK option back to false.

Migrating to Built-In Storage from S3

To migrate from the S3 asset storage to the built-in storage, follow these steps:

  1. Set the ENABLE_ASSET_STORAGE_FALLBACK configuration option to true.

  2. Enable the S3 asset storage as a fallback by setting ENABLE_ASSET_STORAGE_FALLBACK_S3 to true.

  3. Set the ASSET_STORAGE_BACKEND configuration option to built-in. Do not remove any of the S3 configuration options.

  4. Run the migration script by executing the pspdfkit assets:migrate:from-s3-to-built-in command in the Document Engine container.

  • If you use docker-compose, run the following command in the directory where you have your docker-compose.yml file: docker-compose run pspdfkit pspdfkit assets:migrate:from-s3-to-built-in.

  • If you don’t use docker-compose, first find the name of the Document Engine container using docker ps -a. This will list all running containers and their names. Then, run the following command, replacing <container name> with the actual Document Engine container name: docker exec <container name> pspdfkit assets:migrate:from-s3-to-built-in.

  1. When all your documents have been migrated, set the ENABLE_ASSET_STORAGE_FALLBACK option back to false and remove all the S3 configuration options.

Migrating to and from Azure Blob Storage

We currently don’t support batch migrations of assets to or from Azure Blob Storage. That said, you can still migrate an individual document’s assets to or from Azure. Learn more about this here.

Per-Document Storage

In addition to configuring a default storage backend for all documents by setting the ASSET_STORAGE_BACKEND variable, you can upload documents to specific storage backends so long as those backends are enabled as fallbacks in your Document Engine configuration.

Enabling Fallbacks for Asset Storage

To use multiple asset stores in your Document Engine instance, you can configure the main asset store by setting ASSET_STORAGE_BACKEND to built-in, azure, or s3.

Once configured, the storage backend set in ASSET_STORAGE_BACKEND will be used as the default storage for all documents. To store a specific document in a different asset storage backend than the configured default, the other asset storage needs to be enabled as a fallback.

For example, if ASSET_STORAGE_BACKEND is set to azure, then by default, all documents and their assets will be stored in Azure Blob Storage using the configured Azure credentials. However, you can configure a specific document to be stored in AWS S3. You can set this when uploading the document. To do this, S3 needs to be enabled as a fallback asset store.

To enable fallback asset storage, you need to set ENABLE_ASSET_STORAGE_FALLBACK to true. After that, enable the specific fallbacks you want by setting any of the following to true:

  • ENABLE_ASSET_STORAGE_FALLBACK_POSTGRES

  • ENABLE_ASSET_STORAGE_FALLBACK_S3

  • ENABLE_ASSET_STORAGE_FALLBACK_AZURE

In addition to enabling the specific fallback, you need to also set any relevant configuration options for all the storage backends you have enabled. For example, if you enable S3 as an asset fallback, you need to provide the relevant configuration for S3, including the default S3 bucket.

Information

Enabling and using fallback storage backends introduces a slight decrease in performance when fetching assets.

Uploading Documents to Different Storage

You can specify the storage option when uploading a document to Document Engine. This way, documents can be stored in different storage backends — as long as the storage backend is enabled as either the default storage or as a fallback. Learn more about the various options you can set when uploading a document from our API Reference.

Here’s an example of a request uploading a document and specifying the S3 bucket to use for that document:

// With Document Engine running on `http://localhost:5000`.

curl -X POST http://localhost:5000/api/documents \
    -H "Authorization: Token token=secret" \
    -H "Content-Type: multipart/form-data" \
    -F '[email protected]' \
    -F 'storage={
			"backend": "s3",
			"bucketName": "a-different-bucket-from-default-s3-bucket",
			"bucketRegion": "us-west-2"
    	}'

Migrating a Document’s Assets to Different Storage

You can migrate all the assets associated with a document (PDFs, images, file attachments, etc.) and all its layers to another storage backend by making a request to /api/documents/{documentId}/migrate_assets.

Here’s an example curl request to migrate a document’s assets to the built-in storage:

// With Document Engine running on `http://localhost:5000`.

curl -X POST http://localhost:5000/api/documents/{documentID}/migrate_assets \
    -H "Authorization: Token token=secret" \
    -H "Content-type: application/json" \
    -d '{
			"storage": {
				"backend": "built-in"
			}
    	}'

Learn more about migrating assets from our API Reference.

Multiple S3 Buckets

Documents can be uploaded (or migrated) to many different S3 buckets so long as the instance role associated with your Document Engine nodes (or the AWS credentials configured for Document Engine) has the required permissions to access all the buckets you intend to upload or migrate documents to.

Information

This feature is currently only available for S3. With Azure Blob Storage, all documents need to be stored in the default configured storage account, AZURE_STORAGE_ACCOUNT_NAME.

MinIO

With Helm

The Document Engine Helm chart has an optional dependency on MinIO, an S3-compatible object storage implementation. To enable it, use the following values:

assetStorage:
  assetStorageBackend: s3
  s3:
    accessKeyId: "pspdfkitObjectStorageRootKey"
    secretAccessKey: "pspdfkitObjectStorageRootPassword"
    bucket: "document-engine-assets"
    region: "us-east-1"
    host: "minio"
    port: 9000
    scheme: "http://"
minio:
  enabled: true
  fullnameOverride: minio
  nameOverride: minio
  auth:
    rootUser: pspdfkitObjectStorageRootKey
    rootPassword: pspdfkitObjectStorageRootPassword
  defaultBuckets: "document-engine-assets"

With Docker Compose

To run the MinIO Docker container, use the following command:

docker pull minio/minio
docker run -p 9000:9000 minio/minio server /export

After running these commands, you’ll see the AccessKey and SecretKey printed out in the terminal, which you can use to access the MinIO web interface at http://localhost:9000/minio.

You can now configure docker-compose.yml, like this:

environment:
	ASSET_STORAGE_BACKEND: S3
	ASSET_STORAGE_S3_BUCKET: <minio bucket name>
	ASSET_STORAGE_S3_ACCESS_KEY_ID: <minio access key>
	ASSET_STORAGE_S3_SECRET_ACCESS_KEY: <minio secret access key>
	ASSET_STORAGE_S3_SCHEME: http://
	ASSET_STORAGE_S3_HOST: pssync_minio
	ASSET_STORAGE_S3_PORT: 9000

MinIO supports emulating different regions. It defaults to us-east-1. If you’ve changed your MinIO configuration to a different region, make sure to set ASSET_STORAGE_S3_REGION accordingly.

Azurite

Azurite is an open source emulator from Microsoft for testing Azure Blob Storage actions in development and test environments. Our recommended solution when using Azure Blob Storage in production is to use Azurite in development to get closer to dev/prod parity.

To run Azurite in Docker, run the following commands:

docker pull mcr.microsoft.com/azure-storage/azurite
docker run -p 10000:10000 mcr.microsoft.com/azure-storage/azurite

You can then configure Document Engine to use the default storage account on the Azurite instance. Learn more about the default storage account from Microsoft here.

In your Docker Compose file, for example, you can have this:

environment:
	AZURE_STORAGE_ACCOUNT_NAME: devstoreaccount1
	AZURE_STORAGE_ACCOUNT_KEY: Eby8vdM02xNOcqFlqUwJPLlmEtlCDXJ1OUzFT50uSRZ6IFsuFq2UVErCz4I6tq/K1SZFPTOtr/KBHBeksoGMGw==
	AZURE_STORAGE_DEFAULT_CONTAINER: pspdfkit-dev
	AZURE_STORAGE_API_URL: http://localhost:10000/devstoreaccount1