Optimize your documents with extended OCR steps
Document Automation Server (DAS) Command Line interface (CLI) also supports the Extended OCR module.
Using /ocrengine=1 as a parameter is a requirement.
autobahndx.exe /operation=[operation name] /source=[tiff file or folder] /output=[output file] /target=[target folder] [/option=value]…</br> Examples: 1. Generate a searchable PDF c:\out\outfile.pdf and a Word file c:\out\outfile.docx from a multi-page TIFF file autobahndx.exe /source="C:\ADX Demo\In\PDF\File\US2007246939A1.pdf" /sourcetype=file /target="C:\ADX Demo\Output" /output=%FILENAME /outputtype=pdf,docx /operation=ocrimagepdf /ocrengine=1 2. Generate a searchable PDF file from a folder of TIFF and JPEG files, with Deskew and page orientation detection and correction. autobahndx.exe /source="C:\ADX Demo\In\TIFF\Folder" /sourcetype=folder /target="C:\ADX Demo\Output" /output=outfilef /outputtype=pdf /autorotate /deskew /operation=mergetifftopdf /ocrengine=1 3. Generate searchable PDF files from image PDF files found in a folder and subfolders, while keeping the original file names. autobahndx.exe /source="C:\ADX Demo\In\PDF\Tree" /target="C:\ADX Demo\Output" /sourcetype=tree /output=%FILENAME /outputtype=pdf /operation=ocrimagepdf /ocrengine=1
The Extended OCR steps use the parameters listed in the table below.
Parameter | Notes |
---|---|
/operation | The operation that needs to be carried out:- tifftopdf- mergetifftopdf- ocrimagepdf- ocranyfileex |
/ocrengine | The OCR engine to use. This must be set to 1 to use the IRIS engine. /ocrengine=1 |
/source | Source file or folder |
/sourcetype | File (default) or Folder |
/target | The Target folder |
/output | The output filename excluding the extension (which will be added according to the output file type) |
/outputtype | One or more of the following, separated by commas if more than one is required.- RTF- PDF- DOCX- CSV*- SML (SpreadsheetML XML file)*- HTM- TXT *These output formats are suitable for table-oriented pages that can be mapped onto a spreadsheet format |
/ExtractImages | Whether to convert the images in a PDF document to TIFF or not.- Convert to TIFF: The pages in the PDF document are rasterized and saved as TIFF images- Native: This method places the OCR’ed text directly into a copy of the original PDF rather than creating an entirely new PDF |
/Autorotate | Detect page orientation and correct if required |
/RemoveBlankPage | Set this to true to remove blank pages from Tiff or PDF documents. Value needs to be set for sensitivity (see below) |
/Sensitivity | The sensitivity, from 1 to 100. With high sensitivity, fewer blank pages are detected |
/Deskew | Rotates the image to correct its skew angle |
/AdvancedDeskew | Set this to true if you want to set the advanced deskew properties below |
/AdjustmentMode | Set the behavior regarding dimension adjustment for deskew operation |
/ForceDeskew | If turned off, the image is analyzed before rotation and the engine may choose not to rotate the image depending on the analysis result If turned on, the image is rotated to correct skew angle |
/Despeckle | Removes all the groups of connected pixels with a number of pixels below the parameter. Suggested range: 1-20 |
/Workdepth | This parameter (0 – 255) defines how deeply the OCR engine will analyze a page with 255 being the deepest. For poorer quality documents, higher values can give better recognition results |
/JPEGQuality | This parameter (0 – 255) determines the compression/quality of color JPEG images in generated PDFs. 0 gives the smallest file size whilst 255 gives the best quality. The default value is 128 |
/PDFVersion | This determines the PDF version of the generated PDF:- 1.4- 1.5- 1.6- 1.7- 1.7 Extension Level 3- 1.7 Extension Level 5- 1.7 Extension Level 8- PDF/A-1a- PDF/A-1b- PDF/A-2a- PDF/A-2b- PDF/A-3a- PDF/A-3b |
/ValidatePDFA | Set this flag to true if you want to validate the output PDF/A files. |
/LanguageDetection | Set this flag to true to enable Auto Language Detection feature. The aim of this feature is to detect the most probable language of a single-language page. If at least one language has been detected, recognition will be performed in the first language candidate that has been detected, and not in the language(s) set through Language or Languages. If it fails to detect a language, recognition will be performed using the language(s) set through Language or Languages. Note: This is set to true by default. |
/language | Determines the language to be used for OCR. This may be a comma separated list for multiple languages, for example, /language=1, 2 for German and French. Note: These codes are not the same as those used by the default low-code engine. See the “Extended OCR Languages” table for more information. |
/createfolders | Create an output folder if it does not exist. Default true. |
/dpi | Sets the DPI of images in the output file. Set to Auto by default, alternatively can be set to 300, 200 or 150 to force a specific resolution. |
/nonimagepdf | This allows control over the treatment of non-image-only PDFs, for example, PDFs that have some text in them as well as images. The options are:- OCR: The document will OCRed using the image method defined by “Image Method”.- Raise Error: The task will terminate with an error. If “On Error Continue” is set then behaves as Skip. This is the default.- Skip: The document will not be processed.- Pass Through: The file will not be processed, but a copy of the document will be made and named as if the processing had occurred. |
/noocr | Whether are not to perform OCR on the document (Yes to not perform OCR, No to perform OCR). |
/AdvancedDespeckle | Set the advanced despeckle settings, advanced despeckle provides advanced image noise reduction features by the image despeckle filter. |
/RemoveWhitePixels | By default, despeckle removes black pixels. If set to true, the despeckle will remove white pixels rather than black pixels. |
/Dilate | Despeckle removes all the groups of connected pixels with a few pixels below the SpeckleSize parameter. Those connected pixels are not removed if the distance to a larger connected component is below this parameter. As a result, only the isolated pixels get deleted. The maximum value for this property is 20 pixels. The default value is ‘0’. |
/Binarization | Whether or not to perform binarization on the document. |
/Brightness | The brightness (higher values will make the result darker). |
/Contrast | The contrast (lower values will make the result darker). |
/SmoothingLevel | Smoothing may be useful to binarize text with a colored background in order to avoid noisy pixels (0 disables smoothing, higher values smooth more). |
/Undithering | Whether or not to use automatic undithering while processing a page. Note: Automatic undithering will be applied only if smoothing is also activated (SmoothingLevel). |
/Threshold | Sets the threshold for fixed threshold binarization (0 for automatic threshold computation). |
/RemoveLines | Whether or not to remove lines from an image (The image must be black and white). |
/HorizontalCleanX | The parameter for cleaning noisy pixels attached to the horizontal lines. |
/HorizontalCleanY | The parameter for cleaning noisy pixels attached to the horizontal lines. |
/VerticalCleanX | The parameter for cleaning noisy pixels attached to the vertical lines. |
/VerticalCleanY | The parameter for cleaning noisy pixels attached to the vertical lines. |
/HorizontalDilate | The dilate parameter that helps the detection of horizontal lines. |
/VerticalDilate | The dilate parameter that helps the detection of vertical lines. |
/HorizontalMaxGap | The maximum horizontal line gap to close. It is useful to remove broken lines. |
/VerticalMaxGap | The maximum vertical line gap to close. It is useful to remove broken lines. |
/HorizontalMaxThickness | The maximum thickness of the horizontal lines to remove. It is useful to keep vertical lines larger than this parameter. Can be also useful to keep vertical letter strokes. |
/VerticalMaxThickness | The maximum thickness of the vertical lines to remove. It is useful to keep horizontal lines larger than this parameter. Can be also useful to keep horizontal letter strokes. |
/HorizontalMinLength | The minimum length of the horizontal lines to remove. |
/VerticalMinLength | The minimum length of the vertical lines to remove. |
/RemoveDarkBorders | Removes the dark surrounding from bitonal, grayscale or color images. The dark surrounding of the image is whitened. Note: The dark border should be touching the edge of the page for this to work. |
/RemovePunchHoles | Attempts to remove punch holes from pages. Note: The punch hole algorithm can be used on images with the following minimum dimensions width: 300px, height: 100px (computed for 300 DPI). The minimum height and width can vary with the image resolution. |
/Interpolation | Interpolates the source image to the given resolution. This value (the target resolution) must be greater than the source image’s resolution. |
/InterpolationMode | Sets the interpolation mode. |
/KeepOriginalImage | Set this to true if you want to use the pre-processed image for OCR but keep the original image in the output document. The default value is ‘true’. Note: This setting will only work if ExtractImages is set to Convert to TIFF. |
/KeepDeskew | Set this to true if you want to use the deskewed image in the output document. Note: This property only applies when Keep Original Image is set to No. |
/KeepDespeckle | Set this to true if you want to use the despeckled image in the output document. This requires the source image to be black and white. Note: This property only applies when KeepOriginalImage is set to No. |
/KeepDarkBorderRemoval | Set this to true if you want to use the image after dark borders have been removed, in the output document. Note: This property only applies when KeepOriginalImage is set to No. |
/KeepPunchHoleRemoval | Set this to true if you want to use the image after punch holes have been removed, in the output document. Note: This property only applies when KeepOriginalImage is set to No. |
/resourcesfolder | By default, the OCR resources folder is a subfolder of the distribution/extendedocr folder. This option allows the resources to be located elsewhere if required. |
Extended OCR Languages
Member Name | Value | Description |
---|---|---|
English | 0 | English (American) |
German | 1 | |
French | 2 | |
Spanish | 3 | |
Italian | 4 | |
British | 5 | |
Swedish | 6 | |
Danish | 7 | |
Norwegian | 8 | |
Dutch | 9 | |
Portuguese | 10 | |
Brazilian | 11 | |
Galician | 12 | |
Icelandic | 13 | |
Greek | 14 | |
Czech | 15 | |
Hungarian | 16 | |
Polish | 17 | |
Romanian | 18 | |
Slovak | 19 | |
Croatian | 20 | |
Serbian | 21 | |
Slovenian | 22 | |
Luxembourgish | 23 | |
Finnish | 24 | |
Turkish | 25 | |
Russian | 26 | |
Byelorussian | 27 | |
Ukrainian | 28 | |
Macedonian | 29 | |
Bulgarian | 30 | |
Estonian | 31 | |
Lithuanian | 32 | |
Afrikaans | 33 | |
Albanian | 34 | |
Catalan | 35 | |
Irish Gaelic | 36 | |
Scottish Gaelic | 37 | |
Basque | 38 | |
Breton | 39 | |
Corsican | 40 | |
Frisian | 41 | |
Nynorsk | 42 | |
Indonesian | 43 | |
Malay | 44 | |
Swahili | 45 | |
Tagalog | 46 | |
Japanese | 47 | |
Korean | 48 | |
Schinese | 49 | Simplified Chinese |
Tchinese | 50 | Traditional Chinese |
Quecha | 51 | |
Aymara | 52 | |
Faroese | 53 | |
Friulian | 54 | |
Greenlandic | 55 | |
Haitian_Creole | 56 | |
Rhaeto_Roman | 57 | |
Sardinian | 58 | |
Kurdish | 59 | |
Cebuano | 60 | |
Bemba | 61 | |
Chamorro | 62 | |
Fijan | 63 | |
Ganda | 64 | |
Hani | 65 | |
Ido | 66 | |
Interlingua | 67 | |
Kicongo | 68 | |
Kinyarwanda | 69 | |
Malagasy | 70 | |
Maori | 71 | |
Mayan | 72 | |
Minangkabau | 73 | |
Nahuatl | 74 | |
Nyanja | 75 | |
Rundi | 76 | |
Samoan | 77 | |
Shona | 78 | |
Somali | 79 | |
Sotho | 80 | |
Sundanese | 81 | |
Tahitian | 82 | |
Tonga | 83 | |
Tswana | 84 | |
Wolof | 85 | |
Xhosa | 86 | |
Zapotec | 87 | |
Javanese | 88 | |
Pidgin_Nigeria | 89 | |
Occitan | 90 | |
Manx | 91 | |
Tok_Pisin | 92 | |
Bislama | 93 | |
Hiligaynon | 94 | |
Kapampangan | 95 | |
Balinese | 96 | |
Bikol | 97 | |
Ilocano | 98 | |
Madurese | 99 | |
Waray | 100 | |
None | 101 | No language, Latin alphabet |
Serbian_Latin | 102 | |
Latin | 103 | |
Latvian | 104 | |
Hebrew | 105 | |
Numeric | 114 | |
Esperanto | 115 | |
Maltese | 116 | |
Zulu | 117 | |
Afaan | 118 | |
Asturian | 119 | |
AzeriLatin | 120 | |
Luba | 121 | |
Papamiento | 122 | |
Tatar | 123 | |
Turkmen | 124 | |
Welsh | 125 | |
Arabic | 126 | Note:- You need to set ExtractImageMethod to ConvertToTiff to use Arabic language.- Arabic and English: Works only for Arabic texts with embedded English words. The result for a zone with only English will be empty. |
Farsi | 127 | |
Mexican | 128 | |
BosnianLatin | 129 | Bosnian (Latin). CharsetCategory.E |
BosnianCyrillic | 130 | Bosnian (Cyrillic). CharsetCategory.D |
Moldovan | 131 | Moldovan. CharsetCategory.E |
SwissGerman | 132 | German (Switzerland). CharsetCategory.C |
Tetum | 133 | Tetum. CharsetCategory.C |
Kazakh | 134 | Kazakh (Cyrillic). CharsetCategory.D |
MongolianCyrillic | 135 | Mongolian (Cyrillic). CharsetCategory.D |
UzbekLatin | 136 | Uzbek (Latin). CharsetCategory.C |