Migrate documents from Amazon S3

Migrate documents stored on Amazon S3 into Document Engine using the remote document URL API. This enables Document Engine to fetch documents directly from S3 as needed, eliminating additional storage costs.

To ensure this works correctly, avoid using signed URLs, since the URL must remain valid for the document’s entire lifespan. Additionally, the file retrieved from the URL must match the original file uploaded. If a document’s URL becomes invalid, you must delete and reupload the document, as updating the URL for an existing document isn’t possible.

If you currently use signed URLs, consider the approaches outlined below.

Changing the S3 bucket policy

If you use S3 with signed URLs, we recommend modifying the S3 bucket policy. Instead of providing Document Engine with signed URLs that include tokens and expiration dates, allow access from the IP where Document Engine is hosted. This removes the need for signed URLs while keeping your documents secure, as Document Engine doesn’t expose these URLs.

Using an internal endpoint for redirection

In some cases, the best way to work with assets in Document Engine is to provide URLs that redirect to the final asset location. This approach is useful in the following scenarios:

Presigned S3 URLs are the only option for allowing Document Engine to access documents, especially if modifying S3 bucket policies isn’t feasible or doesn’t work.
You need to add credentials to the URLs provided to Document Engine.

Both scenarios involve signed URLs, which present the following challenges:

Presigned S3 URLs expire after a set time, meaning a document may become inaccessible when Document Engine tries to retrieve it later.
Other signed URLs may contain sensitive credentials that shouldn’t be stored in Document Engine’s database.

To address these issues, you can create an internal endpoint in your backend or a sidecar service. This endpoint can be a Lambda function, a small service, or an additional route within your backend.

The internal endpoint will generate and redirect requests to signed URLs. Instead of providing signed URLs directly to Document Engine, provide URLs pointing to this internal endpoint, which will handle redirection to the signed URLs where Document Engine can retrieve the assets.

Example using a Node.js server to generate and redirect signed URLs

The following code snippet demonstrates how to set up a Node.js server that generates signed URLs and redirects requests to them. The server listens at /documents/:documentId, where :documentId represents the document identifier used in your application.

For example, you can provide Document Engine with the URL https://yourapp.com/documents/myDocumentIdentifier1, and this endpoint will redirect to a signed URL, such as s3.amazon.com/myDocument.pdf?token=123455.

// Catch the document endpoint.
app.get("/documents/:documentId", function (req, res) {
  // Here we generate the signed URL for Amazon S3 for the unique document ID 
  // that we were provided with and that will be used by Document Engine.
  // See https://docs.aws.amazon.com/AmazonS3/latest/API/s3_example_s3_Scenario_PresignedUrl_section.html
  
  const { getSignedUrl } = require("@aws-sdk/s3-request-presigner");
  const { S3Client, GetObjectCommand } = require("@aws-sdk/client-s3");

  // ... use req.params.documentId to construct the S3 command
  const client = new S3Client(clientParams);
  const command = new GetObjectCommand(getObjectParams);
  const preSignedUrl = await getSignedUrl(client, command, { expiresIn: 3600 });

  // respond with the URL
  res.redirect(preSignedUrl);
});

The getSignedUrl function responsible for generating the signed S3 URL includes expiry parameters to ensure secure access. This implementation is similar to the approach outlined in the Java example from the AWS guides.

Uploading documents to Document Engine

You can either upload documents on demand or migrate all documents at once to Document Engine.

Uploading all documents

To upload all your documents in one go, use the API to add a document from a URL. Call this API for each document URL to register them with Document Engine. If your application already assigns unique IDs to documents, you can include them in the request to maintain consistency across systems:

POST /api/documents
Content-Type: application/json
Authorization: Token token="<secret token>"

{
  "url": "http://file.example.com/sample.pdf",
  "document_id": "my_document_id_1"
}

curl http\://127.0.0.1\:5000/api/documents \
    -X POST \
    -H "Authorization: Token token=<secret token>" \
    -H "Content-type: application/json" \
    -d '{"url": "http://file.example.com/sample.pdf", "document_id": "my_document_id_1"}'

Uploading documents on demand

When a user requests a document, first check if it already exists in Document Engine. If the document is available, serve it immediately. Otherwise, upload it using the same ID the user provided.

For example, if your application has a route like /documents/:id and a user requests my_document_id_1, you can check if the document exists using the document info endpoint:

GET /api/documents/my_document_id_1/document_info

If the document doesn’t exist, the request returns a 404 error. In that case, upload the document using the adding a document from a URL endpoint:

POST /api/documents
Content-Type: application/json
Authorization: Token token="<secret token>"

{
  "url": "https://s3.amazon.com/my-bucket/sample.pdf",
  "document_id": "my_document_id_1"
}

curl http\://127.0.0.1\:5000/api/documents \
    -X POST \
    -H "Authorization: Token token=<secret token>" \
    -H "Content-type: application/json" \
    -d '{"url": "https://s3.amazon.com/my-bucket/sample.pdf", "document_id": "my_document_id_1"}'