Optimize your documents with extended OCR steps

Document Automation Server (DAS) Command Line interface (CLI) also supports the Extended OCR module.

Using /ocrengine=1 as a parameter is a requirement.

autobahndx.exe /operation=[operation name] /source=[tiff file or folder] /output=[output file] /target=[target folder] [/option=value]…</br>

Examples:

1. Generate a searchable PDF c:\out\outfile.pdf and a Word file c:\out\outfile.docx from a multi-page TIFF file autobahndx.exe /source="C:\ADX Demo\In\PDF\File\US2007246939A1.pdf" /sourcetype=file /target="C:\ADX Demo\Output" /output=%FILENAME /outputtype=pdf,docx /operation=ocrimagepdf /ocrengine=1

2. Generate a searchable PDF file from a folder of TIFF and JPEG files, with Deskew and page orientation detection and correction.
autobahndx.exe /source="C:\ADX Demo\In\TIFF\Folder" /sourcetype=folder /target="C:\ADX Demo\Output" /output=outfilef /outputtype=pdf /autorotate /deskew /operation=mergetifftopdf /ocrengine=1

3. Generate searchable PDF files from image PDF files found in a folder and subfolders, while keeping the original file names.

autobahndx.exe  /source="C:\ADX Demo\In\PDF\Tree" /target="C:\ADX Demo\Output"  /sourcetype=tree /output=%FILENAME /outputtype=pdf /operation=ocrimagepdf /ocrengine=1

The Extended OCR steps use the parameters listed in the table below.

Parameter Notes
/operation The operation that needs to be carried out:
- tifftopdf
- mergetifftopdf
- ocrimagepdf
- ocranyfileex
/ocrengine The OCR engine to use. This must be set to 1 to use the IRIS engine.
/ocrengine=1
/source Source file or folder
/sourcetype File (default) or Folder
/target The Target folder
/output The output filename excluding the extension (which will be added according to the output file type)
/outputtype One or more of the following, separated by commas if more than one is required.
- RTF
- PDF
- DOCX
- CSV*
- SML (SpreadsheetML XML file)*
- HTM
- TXT
*These output formats are suitable for table-oriented pages that can be mapped onto a spreadsheet format
/ExtractImages Whether to convert the images in a PDF document to TIFF or not.
- Convert to TIFF: The pages in the PDF document are rasterized and saved as TIFF images
- Native: This method places the OCR’ed text directly into a copy of the original PDF rather than creating an entirely new PDF
/Autorotate Detect page orientation and correct if required
/RemoveBlankPage Set this to true to remove blank pages from Tiff or PDF documents. Value needs to be set for sensitivity (see below)
/Sensitivity The sensitivity, from 1 to 100. With high sensitivity, fewer blank pages are detected
/Deskew Rotates the image to correct its skew angle
/AdvancedDeskew Set this to true if you want to set the advanced deskew properties below
/AdjustmentMode Set the behavior regarding dimension adjustment for deskew operation
/ForceDeskew If turned off, the image is analyzed before rotation and the engine may choose not to rotate the image depending on the analysis result If turned on, the image is rotated to correct skew angle
/Despeckle Removes all the groups of connected pixels with a number of pixels below the parameter. Suggested range: 1-20
/Workdepth This parameter (0 – 255) defines how deeply the OCR engine will analyze a page with 255 being the deepest. For poorer quality documents, higher values can give better recognition results
/JPEGQuality This parameter (0 – 255) determines the compression/quality of color JPEG images in generated PDFs. 0 gives the smallest file size whilst 255 gives the best quality. The default value is 128
/PDFVersion This determines the PDF version of the generated PDF:
- 1.4
- 1.5
- 1.6
- 1.7
- 1.7 Extension Level 3
- 1.7 Extension Level 5
- 1.7 Extension Level 8
- PDF/A-1a
- PDF/A-1b
- PDF/A-2a
- PDF/A-2b
- PDF/A-3a
- PDF/A-3b
/ValidatePDFA Set this flag to true if you want to validate the output PDF/A files.
/LanguageDetection Set this flag to true to enable Auto Language Detection feature. The aim of this feature is to detect the most probable language of a single-language page.
If at least one language has been detected, recognition will be performed in the first language candidate that has been detected, and not in the language(s) set through Language or Languages. If it fails to detect a language, recognition will be performed using the language(s) set through Language or Languages.
Note: This is set to true by default.
/language Determines the language to be used for OCR. This may be a comma separated list for multiple languages, for example, /language=1, 2 for German and French.
Note: These codes are not the same as those used by the default low-code engine.
See the “Extended OCR Languages” table for more information.
/createfolders Create an output folder if it does not exist. Default true.
/dpi Sets the DPI of images in the output file. Set to Auto by default, alternatively can be set to 300, 200 or 150 to force a specific resolution.
/nonimagepdf This allows control over the treatment of non-image-only PDFs, for example, PDFs that have some text in them as well as images. The options are:
- OCR: The document will OCRed using the image method defined by “Image Method”.
- Raise Error: The task will terminate with an error. If “On Error Continue” is set then behaves as Skip. This is the default.
- Skip: The document will not be processed.
- Pass Through: The file will not be processed, but a copy of the document will be made and named as if the processing had occurred.
/noocr Whether are not to perform OCR on the document (Yes to not perform OCR, No to perform OCR).
/AdvancedDespeckle Set the advanced despeckle settings, advanced despeckle provides advanced image noise reduction features by the image despeckle filter.
/RemoveWhitePixels By default, despeckle removes black pixels. If set to true, the despeckle will remove white pixels rather than black pixels.
/Dilate Despeckle removes all the groups of connected pixels with a few pixels below the SpeckleSize parameter. Those connected pixels are not removed if the distance to a larger connected component is below this parameter. As a result, only the isolated pixels get deleted. The maximum value for this property is 20 pixels. The default value is ‘0’.
/Binarization Whether or not to perform binarization on the document.
/Brightness The brightness (higher values will make the result darker).
/Contrast The contrast (lower values will make the result darker).
/SmoothingLevel Smoothing may be useful to binarize text with a colored background in order to avoid noisy pixels (0 disables smoothing, higher values smooth more).
/Undithering Whether or not to use automatic undithering while processing a page.
Note: Automatic undithering will be applied only if smoothing is also activated (SmoothingLevel).
/Threshold Sets the threshold for fixed threshold binarization (0 for automatic threshold computation).
/RemoveLines Whether or not to remove lines from an image (The image must be black and white).
/HorizontalCleanX The parameter for cleaning noisy pixels attached to the horizontal lines.
/HorizontalCleanY The parameter for cleaning noisy pixels attached to the horizontal lines.
/VerticalCleanX The parameter for cleaning noisy pixels attached to the vertical lines.
/VerticalCleanY The parameter for cleaning noisy pixels attached to the vertical lines.
/HorizontalDilate The dilate parameter that helps the detection of horizontal lines.
/VerticalDilate The dilate parameter that helps the detection of vertical lines.
/HorizontalMaxGap The maximum horizontal line gap to close. It is useful to remove broken lines.
/VerticalMaxGap The maximum vertical line gap to close. It is useful to remove broken lines.
/HorizontalMaxThickness The maximum thickness of the horizontal lines to remove. It is useful to keep vertical lines larger than this parameter. Can be also useful to keep vertical letter strokes.
/VerticalMaxThickness The maximum thickness of the vertical lines to remove. It is useful to keep horizontal lines larger than this parameter. Can be also useful to keep horizontal letter strokes.
/HorizontalMinLength The minimum length of the horizontal lines to remove.
/VerticalMinLength The minimum length of the vertical lines to remove.
/RemoveDarkBorders Removes the dark surrounding from bitonal, grayscale or color images. The dark surrounding of the image is whitened.
Note: The dark border should be touching the edge of the page for this to work.
/RemovePunchHoles Attempts to remove punch holes from pages.
Note: The punch hole algorithm can be used on images with the following minimum dimensions width: 300px, height: 100px (computed for 300 DPI). The minimum height and width can vary with the image resolution.
/Interpolation Interpolates the source image to the given resolution. This value (the target resolution) must be greater than the source image’s resolution.
/InterpolationMode Sets the interpolation mode.
/KeepOriginalImage Set this to true if you want to use the pre-processed image for OCR but keep the original image in the output document. The default value is ‘true’.
Note: This setting will only work if ExtractImages is set to Convert to TIFF.
/KeepDeskew Set this to true if you want to use the deskewed image in the output document.
Note: This property only applies when Keep Original Image is set to No.
/KeepDespeckle Set this to true if you want to use the despeckled image in the output document. This requires the source image to be black and white.
Note: This property only applies when KeepOriginalImage is set to No.
/KeepDarkBorderRemoval Set this to true if you want to use the image after dark borders have been removed, in the output document.
Note: This property only applies when KeepOriginalImage is set to No.
/KeepPunchHoleRemoval Set this to true if you want to use the image after punch holes have been removed, in the output document.
Note: This property only applies when KeepOriginalImage is set to No.
/resourcesfolder By default, the OCR resources folder is a subfolder of the distribution/extendedocr folder. This option allows the resources to be located elsewhere if required.

Extended OCR Languages

Member Name Value Description
English 0 English (American)
German 1
French 2
Spanish 3
Italian 4
British 5
Swedish 6
Danish 7
Norwegian 8
Dutch 9
Portuguese 10
Brazilian 11
Galician 12
Icelandic 13
Greek 14
Czech 15
Hungarian 16
Polish 17
Romanian 18
Slovak 19
Croatian 20
Serbian 21
Slovenian 22
Luxembourgish 23
Finnish 24
Turkish 25
Russian 26
Byelorussian 27
Ukrainian 28
Macedonian 29
Bulgarian 30
Estonian 31
Lithuanian 32
Afrikaans 33
Albanian 34
Catalan 35
Irish Gaelic 36
Scottish Gaelic 37
Basque 38
Breton 39
Corsican 40
Frisian 41
Nynorsk 42
Indonesian 43
Malay 44
Swahili 45
Tagalog 46
Japanese 47
Korean 48
Schinese 49 Simplified Chinese
Tchinese 50 Traditional Chinese
Quecha 51
Aymara 52
Faroese 53
Friulian 54
Greenlandic 55
Haitian_Creole 56
Rhaeto_Roman 57
Sardinian 58
Kurdish 59
Cebuano 60
Bemba 61
Chamorro 62
Fijan 63
Ganda 64
Hani 65
Ido 66
Interlingua 67
Kicongo 68
Kinyarwanda 69
Malagasy 70
Maori 71
Mayan 72
Minangkabau 73
Nahuatl 74
Nyanja 75
Rundi 76
Samoan 77
Shona 78
Somali 79
Sotho 80
Sundanese 81
Tahitian 82
Tonga 83
Tswana 84
Wolof 85
Xhosa 86
Zapotec 87
Javanese 88
Pidgin_Nigeria 89
Occitan 90
Manx 91
Tok_Pisin 92
Bislama 93
Hiligaynon 94
Kapampangan 95
Balinese 96
Bikol 97
Ilocano 98
Madurese 99
Waray 100
None 101 No language, Latin alphabet
Serbian_Latin 102
Latin 103
Latvian 104
Hebrew 105
Numeric 114
Esperanto 115
Maltese 116
Zulu 117
Afaan 118
Asturian 119
AzeriLatin 120
Luba 121
Papamiento 122
Tatar 123
Turkmen 124
Welsh 125
Arabic 126 Note:
- You need to set ExtractImageMethod to ConvertToTiff to use Arabic language.
- Arabic and English: Works only for Arabic texts with embedded English words. The result for a zone with only English will be empty.
Farsi 127
Mexican 128
BosnianLatin 129 Bosnian (Latin). CharsetCategory.E
BosnianCyrillic 130 Bosnian (Cyrillic). CharsetCategory.D
Moldovan 131 Moldovan. CharsetCategory.E
SwissGerman 132 German (Switzerland). CharsetCategory.C
Tetum 133 Tetum. CharsetCategory.C
Kazakh 134 Kazakh (Cyrillic). CharsetCategory.D
MongolianCyrillic 135 Mongolian (Cyrillic). CharsetCategory.D
UzbekLatin 136 Uzbek (Latin). CharsetCategory.C