Starting with the release of versions 7.1 of the Muhimbi PDF Converter Services and the Muhimbi PDF Converter for SharePoint we have added support for OCR (Optical Character Recognition). This has been on our roadmap for years - and requested by many customers - but developing such advanced functionality takes some time.
Please read on for details about scenarios that benefit from Optical Character Recognition, OCR facilities provided by our range of server based PDF Conversion products as well as some sample files. You may also want to have a look at The How and Why of OCR.
OCR Use Case 1 – Making scanned documents searchable
One of the more popular questions our support desk receives is if converted PDF files are searchable by users and indexable by search engines. The answer to that question has always been Yes …… when the source document consists of real text such as MS-Word, Excel, MSG, EML, HTML and most of the other file formats we support.
The story is quite different when the source file is a scanned document, which generally just contains a picture of the text. Generally search engines do not understand these image based files and will simply skip them. The solution is to OCR these documents, a process that recognises text and places it in a hidden layer. The resulting document still looks identical to the original file, but search engines and PDF readers are intelligent enough to retrieve the text. As a result scanned documents are fully searchable and content can even be copied to the clipboard for pasting in other applications.
OCR Use Case 2 – Extracting text from a page region
Another common use for OCR is extracting text from an area on a page. Let’s take an order processing system as an example. Orders arrive via various channels, but they always use a predetermined template. Each order is passed through a scanner and placed in a file repository such as SharePoint. Although the computer happily stores the scanned file, it cannot interpret its content so it is up to a human to enter meta-data such as the Customer ID and Order Number. Providing this information is always stored (roughly) in the same area, an OCR based solution can extract the details automatically and without human intervention.
The use case for making scanned documents searchable is supported by version 7.1 of the PDF Converter. Extracting text is planned for version 7.2.
The key features of the current release are:
-
Server based solution, accessible via a modern Web Service interface ( Java, C#, Ruby, PHP etc)
-
Integrates with SharePoint Designer and Nintex workflows.
-
Convert image based files such as TIFF, Scanned PDF, PNG, JPG, BMP, GIF to searchable PDFs.
-
Support for multiple languages ( Danish, German, English, Dutch, Finnish, French, Hungarian, Italian, Norwegian, Portuguese, Spanish and Swedish with more to come) .
-
Additional languages and custom fonts can be added by customers and third parties.
-
OCR is fully integrated with the conversion pipeline allowing a single web service call to Convert, OCR, Watermark, Merge and Secure documents.
-
Whitelist / Blacklist certain characters. For example limit recognition to numbers by whitelisting 1234567890. This prevents, for example, a 0 (zero) to be recognised as the letter o or O.
Scanned Document with OCRed text selected
Please keep in mind that OCR is not some kind of white (or black) magic. If the source material is of poor quality (a lot of noise, scratches, low resolution, funny fonts) then don’t expect your text to be recognised with a high level of accuracy. However, when the scans use at least 300dpi and the font size is not smaller than 10pt, then you should see good results.
The current release is not perfect, we will continue to improve the OCR facilities over the next few years in the same way that we keep evolving the rest of the product.
The main limitations are currently as follows:
-
The JPXDecode (JPEG2000) image encoding type is currently not supported. As a workaround use our software to convert the JPEG2000 encoded PDF to a PDF version that uses different encoding (e.g. PDF 1.4).
-
Performance is not yet as quick as we would like it to be. Note that OCR performance is measured in seconds per page, not milliseconds per page like most of the other operations carried out by our software. We are looking to improve performance significantly though. Multiple OCR tasks can already execute in parallel to make use of multicore systems.
-
The system cannot be used to recognise human handwriting.
Sample code
As they say, a picture is worth a thousand words so an example files are provided below. Both the original scanned image as well as a PDF that OCR has been carried out on can be downloaded. Open the OCRed PDF and try searching for a word.
Download Original, unprocessed, PDF
This test document was created by printing our license agreement on a standard inkjet printer and then scanning it as a 400dpi monochrome image using JBIG2 (lossless) compression. As you can see text is recognised very well. We left some imperfections in on purpose as we don’t want to provide a doctored example that somehow claims 100% accuracy. No product can do that and neither does ours. This example was prepared using a beta version of the software.
Please note that you need the OCR and PDF/A Archiving add-on license in addition to a valid PDF Converter for SharePoint or PDF Converter Services License in order to use this functionality.
Any questions or remarks? Leave a message in the comments below or contact us.