Ensure PDF searchability with OCR tagging
In order to extract metadata from PDF documents using Entity Extraction, Taxonomy Matching and Zonal Text Extraction, the PDF documents must be text searchable in the first place. If they are image-only, these extraction tasks will fail because there will be no text to extract and process.
To overcome this issue, you can use Document Searchability Tagging in conjunction with Document Searchability OCR to ensure that PDF documents are fully text searchable before Tagging attempts to process them.
For this to work:
-
Both Document Searchability Tagging and OCR must be installed on the same machine.
-
You will need to create a library in Document Searchability OCR that points to the site collection, site or library that you are processing in Tagging and schedule it to run before Tagging.
-
In Tagging, go to Job > Document Settings and set Require Searchlight OCR to ‘Yes’.
-
Schedule it to run after Document Searchability OCR.
Tagging will automatically identify where Document Searchability OCR is installed and query its database to see the documents that have been processed by Document Searchability OCR. If Tagging encounters a document that has not been processed by Document Searchability OCR, it will skip the document and display the following warning message in the log file.
Tagging will keep skipping the document until it is processed by Document Searchability OCR.
If Document Searchability OCR is not installed, Tagging will not work unless you set Require Searchlight OCR (see step 3 above) to ‘No’.