OCR PDFs and Images with Nintex Workflows

In this guide you’ll learn how to OCR a PDF document in SharePoint using a Designer Workflow. This enables you to process scanned or bitmap based content and generate fully searchable PDFs. We’ve also added support for the ability to recognise text on (part of) a page and return the actual text (not a bitmap) to the workflow for further processing. A common use for this functionality is to extract a particular area of text from all those documents that use a common template or layout.

Most organizations deal with scanned (or other bitmap based) content on a regular basis. Faxes are received in a digital inbox, invoices or legal documents are scanned and filed away in a file system or MS SharePoint library or other Document Management System. The problem is that this is ‘dead information’ that cannot be searched or indexed using traditional technology, as the content is stored as one big image which cannot be indexed by search crawlers and, as a result, does not show up in search results.

This is where OCR (Optical Character Recognition) comes in. OCR analyzes image based content – e.g. a scanned PDF or an image embedded in a MS Word file and applies some image recognition logic and then embeds the result in a PDF. The scanned content still looks the same, but you can now copy text from the document and search crawlers can also index this text as well.

OCRed-Document
Scanned Document with OCRed text selected

It is possible to carry out OCR using our standard Convert Document workflow activity, but that requires knowledge of our XML syntax, which - although powerful - is less than user friendly. To make life easier we have created a separate Workflow Activity named Convert to OCRed PDF. This is what it looks like.

OCR-PDF-Workflow-Activity

The workflow activity added is consistent with our other Workflow Activities (e.g. Converting / Watermarking / Merging / Securing) and largely self-describing.

  1. this document: The source document to Convert and OCR. For most workflows selecting Current Item will suffice, but some scenarios may require the lookup of a different item.

  2. this file: The name and location to write the generated file to. Leave this field empty to use the same location and name as the source file. Please note that if your source file is already in PDF format then leaving this field empty will overwrite it. For details about how to specify paths to different libraries / site collections see this blog post.

  3. include / exclude metadata: Control if the source file’s SharePoint metadata is copied to the destination file.

  4. OCR language: The language the source document is written in. It defaults to English, but from version 7.2 and higher we support Arabic, Danish, German, English, Dutch, Finnish, French, Hebrew, Hungarian, Italian, Norwegian, Portuguese, Spanish and Swedish.

  5. OCR Performance: Specify the performance / accuracy of the OCR engine. It is recommended to leave this on the default Slow but accurate setting.

  6. Whitelist / Blacklist: Control which characters are recognised. For example limit recognition to numbers by whitelisting 1234567890. This prevents, for example, a 0 (zero) to be recognised as the letter o or O.

  7. Pagination: In some specific cases a single image spans multiple pages. Enable pagination for those cases.

  8. Regions: By default the entire page is OCRed. To limit OCR to certain parts of a page, e.g. a header and/or footer, you can specify one or more regions using our XML syntax. Have a look at this blog post, but only use the part that starts with (and includes) <Regions>…</Regions>.

  9. List ID: The ID of the list the processed file was written to. This can be used later in the workflow to perform additional tasks on the file such as a check-in or out.

  10. Item ID: The ID of the processed file. Can be used with the List ID

Although creating simple workflows in SharePoint Designer is relatively easy, if the concept of MS SharePoint Designer workflows is new to you then have a look at our simple Getting Started Knowledge Base article.

Note: The OCR and PDF/A Archiving Add-On license is needed in order to use OCR in your production environment.

Additional Resources

Knowledge Base

Blog Articles

Information

We recently released the Muhimbi Document Converter Xtension for Nintex Automation Cloud. You can download it here or learn more about available Muhimbi deployments for Nintex on our product page.

We’ve been working with Nintex Workflow ever since we integrated it into [Document Converter for SharePoint On-Premises][].

One of the workarounds we’ve recommended to our customers over the years is to create an MS SharePoint Designer workflow using our workflow actions and invoke that from Nintex Workflow for Microsoft 365. However, this doesn’t leverage the full power of Nintex Workflow for Microsoft 365 and Muhimbi Document Converter for SharePoint Online.

One other way to leverage the full power of Nintex Workflow for Microsoft 365 and Muhimbi Document Converter for SharePoint Online is to integrate the functionality exposed by Document Converter for SharePoint Online directly into a Nintex workflow by invoking our comprehensive REST API.

When we first released Muhimbi Document Converter for SharePoint Online in early 2015, we were very much aware that, due to technical limitations in MS SharePoint Online, it wouldn’t be possible to bring the full power of our existing on-premises (SP2007–SE) products to the cloud. The first release focused on the most important elements, but one of the key features of our on-premises products was missing: an API to allow integration with third-party solutions and software partners.

Although we have a comprehensive Web Services (SOAP) interface exposed by our on-premises products, it’s less suitable for use by online subscription-based services. So, we now have a simplified REST-based interface, as that’s how modern systems, especially Cloud-based products, talk to each other. This new REST-based service was launched as part of the [Document Converter Services][] product. This is a separate product that has no dependencies on MS SharePoint and can be used to integrate with services such as Microsoft Power Automate, Azure Logic Apps, C#, Java, PHP, JavaScript, Python, and Ruby, as well as many other services, including Nintex Workflow for Microsoft 365. Although it’s available as a standalone subscription, this new service is automatically included in each Document Converter for SharePoint Online subscription at no additional charge.

Using Nintex Workflow to Convert Scans and Images to PDF

These next sections will walk you through how to use Nintex Workflow to convert scans and images to PDF documents.

Prerequisites

Ensure the following prerequisites are in place:

  • [Muhimbi Document Converter for SharePoint Online][] installed and enabled in your workflow’s site with a full or trial subscription.

  • A Document Converter Services full or trial subscription and an associated API key (start trial).

  • Nintex Workflow for Microsoft 365 installed and enabled in your workflow’s site.

  • The appropriate privileges to deploy these apps, as well as author workflows.

  • Working knowledge of Nintex Workflow for Microsoft 365.

Note that this article is for the MS SharePoint Online version of Nintex Workflow.

Building the Workflow

It’s strongly recommended to follow the tutorial below, but the workflow is available for download as well. To use the download, import the file in Nintex Workflow for Microsoft 365, set the API key, and publish it, and you’ll be ready to go.

  1. Navigate to a site collection and document library of your choice. You can choose the option to create a new Nintex workflow. This example uses the standard document library that’s available on most site collections.

  2. Create the following workflow variables, which will be needed later:

  • JSON (Text) — Contains the JavaScript Object Notation (JSON), which is the command that will be sent to the conversion service.

  • API_KEY (Text) — A unique ID that will be used to look up your Muhimbi subscription details.

  • ResponseText (Text) — The status message returned by the conversion service.

  • ResponseCode (Integer) — The status code returned by the conversion service.

    workflow variables

  1. Insert a Set Workflow Status action, edit it, and set it to Started. As MS SharePoint Online doesn’t show a separate status, adding this action will show the status the workflow has actually triggered, and it’ll also give you something to click on to inspect the current status of the workflow.

    set workflow status

  2. Add a Build String action and set the output to the JSON workflow variable. In the String field, enter the following:

[
  "sharepoint_file":
  [
    "site_url":"‍{Workflow Context:Current site URL}‍",
    "source_file_url":"‍{Current Item:Server Relative URL}‍",
    "destination_file_url":"‍{Current Item:Server Relative URL}‍.pdf"
  ],
  "output_format":"PDF",
  "language": "English",
  "performance": "Slow but accurate", 
  "fail_on_error":true
]

Pay attention to the following:

  • JSON Notation — We’ve replaced the curly braces — { } — with square brackets [ ] due to a bug in Nintex Workflow for Microsoft 365. If you have any concerns using square brackets, as they’re also used for array types, you can replace them with anything else, as we’ll fix this in a follow-up step.

  • Copy & Paste — When copying and pasting the JSON code, ensure you paste it in Notepad and copy it back. This strips out non-standard characters and prevents formatting from being copied.

  • References — The text displayed in red is that of Nintex Workflow references. After copying and pasting the code fragment, replace each Nintex reference using the Advanced Lookup facility located below the field.

  • Output file name — In this basic example, add .pdf to the end of the output path and file name. This isn’t particularly pretty, but to keep things simple, we aren’t including the Nintex Workflow actions to strip off the old extension and add the new one. You can use whatever you like here as long as it’s a valid output path and file name.

  1. An earlier step used square brackets in JSON, so we need to replace them with curly braces again. Do this by using the Replace Substring in String action and configuring it as follows:

  • Search String — Enter the opening square bracket [.

  • Replace String — Enter the opening curly brace {.

  • String — Insert a reference to the workflow variable named JSON.

  • Output — Pick the JSON workflow variable to store the results in.

Click Save.

  1. You can now copy the workflow action using the action’s menu and by pasting it as the next action. Configure the newly pasted workflow action and replace the opening bracket with the closing bracket ‘]’.

Do the same for the curly brace and replace ‘{’ with ‘}’, and click Save to save the action. You now have valid JSON that you can send to the conversion service.

  1. As the next step, set the API_KEY. Insert a Set Workflow Variable action and configure it to set the API_KEY workflow variable to the API key you received by email when signing up for the Muhimbi Document Converter Services Online, e.g.:

decafbad-baad-baad-baad-decafbaaaaad

Don’t try to use this particular key, as it won’t work. Ensure you don’t put curly braces around the key. Click Save to save the action.

  1. Next, insert a Web Request action and configure it as follows:

URLhttps://api.muhimbi.com/api/v1/operations/ocr_pdf

Method — POST

Content type — application/json

Add header — Click Add header, specify the API_KEY as the header name, and insert a reference to the API_KEY workflow variable for the header value.

Body — Select the content option and add a reference to the JSON workflow variable in the data field.

Store response content in — ResponseText.

Click Save to save the action.

  1. Finally, insert another Set Workflow Status action and configure it with the text Completed. This should indicate when the workflow instance has completed its run. Your workflow will look something like what’s shown below. Nintex-O365-ConvertBasic-Part1
    Nintex-O365-ConvertBasic-Part2

  2. Save and publish the workflow by giving it a suitable name, and set the Start Options to a value of your choice.

  3. Once published, open the document library the workflow is associated with, make sure a file of the supported type is present, and manually start the workflow. After a few seconds, the PDF file will show up next to the file the workflow was started on.

Troubleshooting

Although Nintex Workflow for Microsoft 365 and Muhimbi Document Converter work well together, there are a lot of moving parts in the workflow — like custom generated JSON, customer-specific API keys, paths to the document libraries, and more. So, there are chances that you may encounter some issues when deploying the workflow. Some common issues and troubleshooting tips are provided below for your reference:

  • Check prerequisites — Doublecheck that the prerequisites listed in the beginning of this section are in place.

  • Log to History List — If it isn’t clear what’s going wrong, log critical parts, such as the JSON workflow variable (after the replace operation) and the ResponseText workflow variable (after the web request) using the Log To History List workflow action. You can see the contents of this list by clicking the Workflow Status column for the List Item the workflow is running on.

  • Send email — The amount of text that can be logged to the History List is limited (roughly 250 characters). For larger messages, use the Send an Email action to instead send an email with debugging content in the body of the email to yourself.

  • Copy & Paste — When copying the JSON fragment into your workflow, paste it into Notepad first to clean it, and then copy it from Notepad and paste it into your workflow. This is because browsers tend to insert hidden characters that aren’t filtered out by the Nintex Workflow editor.

  • Nintex References — Make sure the Nintex Workflow references in the provided JSON are replaced by actual Nintex Workflow references. You can doublecheck if the references are active by logging the JSON workflow variable to the History List. You should see the actual paths and not {Current Item:Server Relative URL}.

  • Muhimbi Support — After doublechecking all prerequisites and going over all troubleshooting steps in this section, if you’re still stuck, contact our friendly support desk.

Finetuning

The workflow created in the previous section was to give a quick idea of how to use the converter.

We have a version of the conversion workflow that is more production ready. Full details on this are beyond the scope of this article. You can download the full workflow here and customize it according to your requirements.

After customization, you can import it into Nintex Workflow for Microsoft 365, set the API KEY, and then publish it for your use.

47

Other Operations

This guide demonstrated how to invoke the Convert action on Muhimbi’s REST interface. Full examples are beyond the scope of this article, but you can find examples in the SharePoint section of our GitHub repository.

Additional Resources