How to OCR PDFs using SharePoint Designer Workflows
Although it had been years in the planning, we didn’t really make a big deal out of the support for Optical Character Recognition (OCR) when we shipped it as part of version 7.1 of the PDF Converter for SharePoint. We did this for a good reason as – although the underpinnings were working well – the actual integration points with SharePoint, specifically SharePoint Designer Workflows, wasn’t as nice as we wanted it to be.
With the release of version 7.2 we are adding two new Workflow Activities to both Nintex Workflow and SharePoint Designer. The first activity, described in this post, can be used to convert scanned content into fully searchable PDFs. A separate post will detail the other new OCR Activity, which can extract text from scanned content. For a high level overview of our OCR facilities please read the original announcement.
This post describes the SharePoint Designer Workflow Activity. The Nintex Workflow equivalent can be found here.
Optical Character Recognition… sounds quite complex, what would you need that for? Well, most organisations deal with scanned (or other bitmap based) content on a regular basis. Faxes are received in a digital inbox, invoices or legal documents are scanned and filed away in a file system / SharePoint library or other Document Management System. The problem is that this is ‘dead information’ that cannot be searched or indexed using traditional technology. Content is stored as one big image which cannot be indexed by search crawlers and, as a result, does not show up in search results.
This is where OCR comes in. OCR analyses image based content – e.g. a scanned PDF or an image embedded in an MS-Word file – applies some fancy recognition logic and then embeds the result in a PDF. The scanned content still looks the same, but you can now copy text from the document and search crawlers are clever enough to index this text as well. Confused? Have a look at the screenshot below.
Scanned Document with OCRed text selected
It is possible to carry out OCR using our standard Convert Document workflow activity, but that requires knowledge of our XML syntax, which - although powerful - is less than user friendly. To make life easier we have created a separate Workflow Activity named Convert to OCRed PDF. This is what it looks like.
In typical Muhimbi fashion the workflow sentence is consistent with our other Workflow Activities (e.g. Converting / Watermarking / Merging / Securing) and largely self-describing.
-
this document: The source document to Convert and OCR. For most workflows selecting Current Item will suffice, but some scenarios may require the look up of a different item.
-
this file: The name and location to write the generated file to. Leave this field empty to use the same location and name as the source file. Please note that if your source file is already in PDF format then leaving this field empty will overwrite it. For details about how to specify paths to different libraries / site collections see this blog post.
-
include / exclude meta data: Control if the source file’s SharePoint meta-data is copied to the destination file.
-
OCR language: The language the source document is written in. It defaults to English, but we currently (version 7.2) support Arabic, Danish, German, English, Dutch, Finnish, French, Hebrew, Hungarian, Italian, Norwegian, Portuguese, Spanish and Swedish.
-
OCR Performance: Specify the performance / accuracy of the OCR engine. It is recommended to leave this on the default Slow but accurate setting.
-
Whitelist / Blacklist: Control which characters are recognised. For example limit recognition to numbers by whitelisting 1234567890. This prevents, for example, a 0 (zero) to be recognised as the letter o or O.
-
Pagination: In some specific cases a single image spans multiple pages. Enable pagination for those cases.
-
Regions: By default the entire page is OCRed. To limit OCR to certain parts of a page, e.g. a header and/or footer, you can specify one or more regions using our XML syntax. Have a look at this blog post, but only use the part that starts with (and includes)
… . -
List ID: The ID of the list the processed file was written to. This can later in the workflow be used to perform additional tasks on the file such as a check-in or out.
-
Item ID: The ID of the processed file. Can be used with the List ID
Although creating simple workflows in SharePoint Designer is relatively easy, there is a first time for everything. If the concept of SharePoint Designer workflows is new to you then have a look at our simple Getting Started Knowledge Base article.
Please note that the OCR and PDF/A Archiving add-on license is needed in order to use OCR in your production environment.
Any questions or comments? Leave a message below or contact us.
Clavin is a Microsoft Business Applications MVP who supports 1,000+ high-level enterprise customers with challenges related to PDF conversion in combination with SharePoint on-premises Office 365, Azure, Nintex, K2, and Power Platform mostly no-code solutions.