The following blog illustrates how you can process Image files (non-searchable PDFs/TIFF) within their own folder location (in-place processesing) along with retaining the same filename.
The scenario: Joe has 150GB of data on his G drive where users upload documents on a daily basis. The drive primarily holds image (non-searchable) files of type PDF and TIFF but also has a number of excel spreadsheets and word documents . He would like to convert the PDFs and TIFF files to searchable PDFs, leaving the excel and word documents as they are. Once he has completed the bacglog of data he would like to only process new PDF and TIFF files which are uploaded by users.
The hardware specification Joe has for Autobahn is:
Intel(R) Core(TM) i7 CPU 860 @ 2.80GHz
The above dedicated machine has 8 CPUs with 4GB RAM
Typically on the above specification machine, a job in Autobahn can be configured in 2 ways depending on the number of pages the source PDF and Tiff files have. The 2 approach configuration can be one of the following:
-
One-to-Many (One job with many (4*) Threads)
-
Many-to-One (Many jobs with 1 Thread)
One-to-Many:
The One-to-Many naming convention is derived from the fact of having one job configured with many threads. This approach works best if the majority of the (source) image files have pages of say around 30 or more.
A job configured in this fashion will split the source image file (PDF/Tiff) into 4 sections when the job initiates, passing each section to each 1 thread. The OCR will be applied to each section and once all of the threads have completed the PDF/Tiff is put back together.
Many-to-One:
The Many-to-One approach essentially is when you have four jobs configured in Autobahn and a single thread assigned to each job. In Joe’s scenario we would split the source folder into 4 subfolders and point each job at each subfolder. This approach works best when you have source files which have less than 30 pages.
In both of the above cases, due to the number of source files which need to be OCR’d, Autobahn SHOULD be set to a count limit which will put each job into ‘batch process’ mode. The following demonstrates the use of this mechanism:
The above filter expression: “Include with Document Count Limit” and value “*PDF, *TIF, ; 1000” dictates that, only 1000 PDF/TIFF files will be processed by Autobahn each time the job is run, you can then set the job to run every 2 hours.
Configuring the jobs can be dependent on the size of the PDFs /TIFFs and the Hardware specification however setting the count limit is crucial on huge repositories and if it is NOT set the application will crawl through the whole file structure to try and build a list of the files which need to be moved for OCR.