Create a Document from a URL
When you already have an existing data store for your files or prefer not to store them with Document Engine, you can create a document from a URL.
When operating on a document from a URL, Document Engine will fetch the file using the provided URL and cache it in the node file system.
-
Your service sends a document’s URL to Document Engine, which makes a request to the URL to retrieve the document.
-
The document service returns the document, and Document Engine saves it and its metadata in the asset storage and PostgreSQL.
-
Your service receives the document ID back, which it can use to reference the document later.
Security Considerations
Please be aware that the workflow of this functionality requires Document Engine to perform a server-side retrieval of data at the specified URL. As such, creating a document from a URL comes with inherent security limitations.
For increased security at the expense of ease of integration, please consider disabling document creation from a URL and instead use the document creation from upload architecture to have data processed by Document Engine sourced from a Postgres database or S3-compatible storage.
The design intent of the document creation from a URL feature is for Document Engine to be able to easily integrate with other known and trusted services by reaching out and collecting data directly via HTTPS.
You should never send an untrusted URL directly to Document Engine. If your service is working with user input or other untrusted data sources, your service needs to implement checks to prevent untrusted URLs from being directly sent to Document Engine so as to mitigate the risk of server-side request forgery.
Because document creation from a URL requires an authorized API call, Document Engine doesn’t place limitations on the network sources it’ll attempt to resolve and retrieve data from. As an exercise in the principle of least privilege, consider limiting outbound network traffic at the container or network firewall level so that Document Engine can only communicate outbound to the sources it’s expected to retrieve data from.