With the release of version 7.1 of PDF Converter for SharePoint, we added a fundamental new technology to our document conversion and manipulation platform: optical character recognition (OCR). The initial release was able to process scanned and bitmap-based content and generate fully searchable PDFs.
With the introduction of version 7.2, we’re adding support for a new OCR-related use case, which is the ability to recognize text on part of a page and return the actual text — not a bitmap — to the workflow for additional processing. A common use for this functionality is to extract a particular area of text from documents that all use a common template or layout. For example, if a reference number can always be found at the top-right corner of scanned documents, then that text can be extracted and stored in a SharePoint column, from where it can then be included in searches or used in more workflow steps. Pretty powerful stuff.
This post describes the SharePoint Designer workflow activity. The Nintex workflow equivalent can be found here.
For more details, including an introduction, refer to the following blog posts:
Once the Muhimbi PDF Converter for SharePoint is installed, you’ll find a number of new workflow activities in SharePoint Designer. One of these activities is named Extract Text using OCR and looks as follows.
The workflow sentence is consistent with our other workflow activities — e.g. converting, watermarking, merging, and securing — and largely self-describing.
-
this document — The source document to OCR and extract text from. For most workflows, selecting Current Item will suffice, but some scenarios may require the lookup of a different item.
-
OCR language — The language the source document is written in. It defaults to English, but we currently, as of version 7.2, support Arabic, Danish, German, English, Dutch, Finnish, French, Hebrew, Hungarian, Italian, Norwegian, Portuguese, Spanish, and Swedish.
-
OCR engine — We now support multiple OCR engines, including the Muhimbi OCR engine and the new GdPicture OCR engine.
-
OCR performance — Specify the performance and accuracy of the OCR engine. It’s recommended to leave this on the default Slow but accurate setting.
-
Whitelist/blacklist — Control which characters are recognized. For example, limit recognition to numbers by allowing 1234567890. This prevents, for example, a 0 (zero) from being recognized as the letter o or O.
-
Pagination — In some specific cases, a single image spans multiple pages. Enable pagination for those cases.
-
Region — Specify the x, y, width, and height of the region to retrieve text from. The unit of measure (UOM) is 1/72nd of an inch. When extracting text from non-PDF files, e.g. a TIFF or PNG, consider that, internally, the image is first converted to PDF, which may add margins around the image but guarantees that a single — unified — UOM is used across all file formats. If you aren’t sure how internal conversion affects the dimensions of your image or scan, use our software to convert the file to PDF and open it in a PDF reader.
-
Page — By default, text is extracted from all pages and concatenated. To extract the text from a specific page, specify the page number in this field.
-
Result — The recognized text will be stored in this variable (type String).
Although creating simple workflows in SharePoint Designer is relatively easy, there’s a first time for everything. If the concept of SharePoint Designer workflows is new to you, refer to our Getting Started Knowledge Base article.
Note that the OCR and PDF/A Archiving add-on license is needed to use OCR in your production environment.
Any questions or comments? Leave a message below or contact us.
Clavin is a Microsoft Business Applications MVP who supports 1,000+ high-level enterprise customers with challenges related to PDF conversion in combination with SharePoint on-premises Office 365, Azure, Nintex, K2, and Power Platform mostly no-code solutions.