Optimize your documents with extended OCR steps

Document Automation Server (DAS) Command Line interface (CLI) also supports the Extended OCR module.

Using /ocrengine=1 as a parameter is a requirement.

autobahndx.exe /operation=[operation name] /source=[tiff file or folder] /output=[output file] /target=[target folder] [/option=value]…</br>
Examples:
1. Generate a searchable PDF c:\out\outfile.pdf and a Word file c:\out\outfile.docx from a multi-page TIFF file autobahndx.exe /source="C:\ADX Demo\In\PDF\File\US2007246939A1.pdf" /sourcetype=file /target="C:\ADX Demo\Output" /output=%FILENAME /outputtype=pdf,docx /operation=ocrimagepdf /ocrengine=1
2. Generate a searchable PDF file from a folder of TIFF and JPEG files, with Deskew and page orientation detection and correction.
autobahndx.exe /source="C:\ADX Demo\In\TIFF\Folder" /sourcetype=folder /target="C:\ADX Demo\Output" /output=outfilef /outputtype=pdf /autorotate /deskew /operation=mergetifftopdf /ocrengine=1
3. Generate searchable PDF files from image PDF files found in a folder and subfolders, while keeping the original file names.
autobahndx.exe /source="C:\ADX Demo\In\PDF\Tree" /target="C:\ADX Demo\Output" /sourcetype=tree /output=%FILENAME /outputtype=pdf /operation=ocrimagepdf /ocrengine=1

The Extended OCR steps use the parameters listed in the table below.

ParameterNotes
/operationThe operation that needs to be carried out:
- tifftopdf
- mergetifftopdf
- ocrimagepdf
- ocranyfileex
/ocrengineThe OCR engine to use. This must be set to 1 to use the IRIS engine.
/ocrengine=1
/sourceSource file or folder
/sourcetypeFile (default) or Folder
/targetThe Target folder
/outputThe output filename excluding the extension (which will be added according to the output file type)
/outputtypeOne or more of the following, separated by commas if more than one is required.
- RTF
- PDF
- DOCX
- CSV*
- SML (SpreadsheetML XML file)*
- HTM
- TXT
*These output formats are suitable for table-oriented pages that can be mapped onto a spreadsheet format
/ExtractImagesWhether to convert the images in a PDF document to TIFF or not.
- Convert to TIFF: The pages in the PDF document are rasterized and saved as TIFF images
- Native: This method places the OCR’ed text directly into a copy of the original PDF rather than creating an entirely new PDF
/AutorotateDetect page orientation and correct if required
/RemoveBlankPageSet this to true to remove blank pages from Tiff or PDF documents. Value needs to be set for sensitivity (see below)
/SensitivityThe sensitivity, from 1 to 100. With high sensitivity, fewer blank pages are detected
/DeskewRotates the image to correct its skew angle
/AdvancedDeskewSet this to true if you want to set the advanced deskew properties below
/AdjustmentModeSet the behavior regarding dimension adjustment for deskew operation
/ForceDeskewIf turned off, the image is analyzed before rotation and the engine may choose not to rotate the image depending on the analysis result If turned on, the image is rotated to correct skew angle
/DespeckleRemoves all the groups of connected pixels with a number of pixels below the parameter. Suggested range: 1-20
/WorkdepthThis parameter (0 – 255) defines how deeply the OCR engine will analyze a page with 255 being the deepest. For poorer quality documents, higher values can give better recognition results
/JPEGQualityThis parameter (0 – 255) determines the compression/quality of color JPEG images in generated PDFs. 0 gives the smallest file size whilst 255 gives the best quality. The default value is 128
/PDFVersionThis determines the PDF version of the generated PDF:
- 1.4
- 1.5
- 1.6
- 1.7
- 1.7 Extension Level 3
- 1.7 Extension Level 5
- 1.7 Extension Level 8
- PDF/A-1a
- PDF/A-1b
- PDF/A-2a
- PDF/A-2b
- PDF/A-3a
- PDF/A-3b
/ValidatePDFASet this flag to true if you want to validate the output PDF/A files.
/LanguageDetectionSet this flag to true to enable Auto Language Detection feature. The aim of this feature is to detect the most probable language of a single-language page.
If at least one language has been detected, recognition will be performed in the first language candidate that has been detected, and not in the language(s) set through Language or Languages. If it fails to detect a language, recognition will be performed using the language(s) set through Language or Languages.
Note: This is set to true by default.
/languageDetermines the language to be used for OCR. This may be a comma separated list for multiple languages, for example, /language=1, 2 for German and French.
Note: These codes are not the same as those used by the default low-code engine.
See the "Extended OCR Languages" table for more information.
/createfoldersCreate an output folder if it does not exist. Default true.
/dpiSets the DPI of images in the output file. Set to Auto by default, alternatively can be set to 300, 200 or 150 to force a specific resolution.
/nonimagepdfThis allows control over the treatment of non-image-only PDFs, for example, PDFs that have some text in them as well as images. The options are:
- OCR: The document will OCRed using the image method defined by “Image Method”.
- Raise Error: The task will terminate with an error. If “On Error Continue” is set then behaves as Skip. This is the default.
- Skip: The document will not be processed.
- Pass Through: The file will not be processed, but a copy of the document will be made and named as if the processing had occurred.
/noocrWhether are not to perform OCR on the document (Yes to not perform OCR, No to perform OCR).
/AdvancedDespeckleSet the advanced despeckle settings, advanced despeckle provides advanced image noise reduction features by the image despeckle filter.
/RemoveWhitePixelsBy default, despeckle removes black pixels. If set to true, the despeckle will remove white pixels rather than black pixels.
/DilateDespeckle removes all the groups of connected pixels with a few pixels below the SpeckleSize parameter. Those connected pixels are not removed if the distance to a larger connected component is below this parameter. As a result, only the isolated pixels get deleted. The maximum value for this property is 20 pixels. The default value is '0'.
/BinarizationWhether or not to perform binarization on the document.
/BrightnessThe brightness (higher values will make the result darker).
/ContrastThe contrast (lower values will make the result darker).
/SmoothingLevelSmoothing may be useful to binarize text with a colored background in order to avoid noisy pixels (0 disables smoothing, higher values smooth more).
/UnditheringWhether or not to use automatic undithering while processing a page.
Note: Automatic undithering will be applied only if smoothing is also activated (SmoothingLevel).
/ThresholdSets the threshold for fixed threshold binarization (0 for automatic threshold computation).
/RemoveLinesWhether or not to remove lines from an image (The image must be black and white).
/HorizontalCleanXThe parameter for cleaning noisy pixels attached to the horizontal lines.
/HorizontalCleanYThe parameter for cleaning noisy pixels attached to the horizontal lines.
/VerticalCleanXThe parameter for cleaning noisy pixels attached to the vertical lines.
/VerticalCleanYThe parameter for cleaning noisy pixels attached to the vertical lines.
/HorizontalDilateThe dilate parameter that helps the detection of horizontal lines.
/VerticalDilateThe dilate parameter that helps the detection of vertical lines.
/HorizontalMaxGapThe maximum horizontal line gap to close. It is useful to remove broken lines.
/VerticalMaxGapThe maximum vertical line gap to close. It is useful to remove broken lines.
/HorizontalMaxThicknessThe maximum thickness of the horizontal lines to remove. It is useful to keep vertical lines larger than this parameter. Can be also useful to keep vertical letter strokes.
/VerticalMaxThicknessThe maximum thickness of the vertical lines to remove. It is useful to keep horizontal lines larger than this parameter. Can be also useful to keep horizontal letter strokes.
/HorizontalMinLengthThe minimum length of the horizontal lines to remove.
/VerticalMinLengthThe minimum length of the vertical lines to remove.
/RemoveDarkBordersRemoves the dark surrounding from bitonal, grayscale or color images. The dark surrounding of the image is whitened.
Note: The dark border should be touching the edge of the page for this to work.
/RemovePunchHolesAttempts to remove punch holes from pages.
Note: The punch hole algorithm can be used on images with the following minimum dimensions width: 300px, height: 100px (computed for 300 DPI). The minimum height and width can vary with the image resolution.
/InterpolationInterpolates the source image to the given resolution. This value (the target resolution) must be greater than the source image's resolution.
/InterpolationModeSets the interpolation mode.
/KeepOriginalImageSet this to true if you want to use the pre-processed image for OCR but keep the original image in the output document. The default value is 'true'.
Note: This setting will only work if ExtractImages is set to Convert to TIFF.
/KeepDeskewSet this to true if you want to use the deskewed image in the output document.
Note: This property only applies when Keep Original Image is set to No.
/KeepDespeckleSet this to true if you want to use the despeckled image in the output document. This requires the source image to be black and white.
Note: This property only applies when KeepOriginalImage is set to No.
/KeepDarkBorderRemovalSet this to true if you want to use the image after dark borders have been removed, in the output document.
Note: This property only applies when KeepOriginalImage is set to No.
/KeepPunchHoleRemovalSet this to true if you want to use the image after punch holes have been removed, in the output document.
Note: This property only applies when KeepOriginalImage is set to No.
/resourcesfolderBy default, the OCR resources folder is a subfolder of the distribution/extendedocr folder. This option allows the resources to be located elsewhere if required.

Extended OCR Languages

Member NameValueDescription
English0English (American)
German1
French2
Spanish3
Italian4
British5
Swedish6
Danish7
Norwegian8
Dutch9
Portuguese10
Brazilian11
Galician12
Icelandic13
Greek14
Czech15
Hungarian16
Polish17
Romanian18
Slovak19
Croatian20
Serbian21
Slovenian22
Luxembourgish23
Finnish24
Turkish25
Russian26
Byelorussian27
Ukrainian28
Macedonian29
Bulgarian30
Estonian31
Lithuanian32
Afrikaans33
Albanian34
Catalan35
Irish Gaelic36
Scottish Gaelic37
Basque38
Breton39
Corsican40
Frisian41
Nynorsk42
Indonesian43
Malay44
Swahili45
Tagalog46
Japanese47
Korean48
Schinese49Simplified Chinese
Tchinese50Traditional Chinese
Quecha51
Aymara52
Faroese53
Friulian54
Greenlandic55
Haitian_Creole56
Rhaeto_Roman57
Sardinian58
Kurdish59
Cebuano60
Bemba61
Chamorro62
Fijan63
Ganda64
Hani65
Ido66
Interlingua67
Kicongo68
Kinyarwanda69
Malagasy70
Maori71
Mayan72
Minangkabau73
Nahuatl74
Nyanja75
Rundi76
Samoan77
Shona78
Somali79
Sotho80
Sundanese81
Tahitian82
Tonga83
Tswana84
Wolof85
Xhosa86
Zapotec87
Javanese88
Pidgin_Nigeria89
Occitan90
Manx91
Tok_Pisin92
Bislama93
Hiligaynon94
Kapampangan95
Balinese96
Bikol97
Ilocano98
Madurese99
Waray100
None101No language, Latin alphabet
Serbian_Latin102
Latin103
Latvian104
Hebrew105
Numeric114
Esperanto115
Maltese116
Zulu117
Afaan118
Asturian119
AzeriLatin120
Luba121
Papamiento122
Tatar123
Turkmen124
Welsh125
Arabic126Note:
- You need to set ExtractImageMethod to ConvertToTiff to use Arabic language.
- Arabic and English: Works only for Arabic texts with embedded English words. The result for a zone with only English will be empty.
Farsi127
Mexican128
BosnianLatin129Bosnian (Latin). CharsetCategory.E
BosnianCyrillic130Bosnian (Cyrillic). CharsetCategory.D
Moldovan131Moldovan. CharsetCategory.E
SwissGerman132German (Switzerland). CharsetCategory.C
Tetum133Tetum. CharsetCategory.C
Kazakh134Kazakh (Cyrillic). CharsetCategory.D
MongolianCyrillic135Mongolian (Cyrillic). CharsetCategory.D
UzbekLatin136Uzbek (Latin). CharsetCategory.C