Optimize your documents with extended OCR steps

Document Automation Server (DAS) Command Line interface (CLI) also supports the Extended OCR module.

Using /ocrengine=1 as a parameter is a requirement.

autobahndx.exe /operation=[operation name] /source=[tiff file or folder] /output=[output file] /target=[target folder] [/option=value]…</br>

Examples:

1. Generate a searchable PDF c:\out\outfile.pdf and a Word file c:\out\outfile.docx from a multi-page TIFF file autobahndx.exe /source="C:\ADX Demo\In\PDF\File\US2007246939A1.pdf" /sourcetype=file /target="C:\ADX Demo\Output" /output=%FILENAME /outputtype=pdf,docx /operation=ocrimagepdf /ocrengine=1

2. Generate a searchable PDF file from a folder of TIFF and JPEG files, with Deskew and page orientation detection and correction.
autobahndx.exe /source="C:\ADX Demo\In\TIFF\Folder" /sourcetype=folder /target="C:\ADX Demo\Output" /output=outfilef /outputtype=pdf /autorotate /deskew /operation=mergetifftopdf /ocrengine=1

3. Generate searchable PDF files from image PDF files found in a folder and subfolders, while keeping the original file names.

autobahndx.exe  /source="C:\ADX Demo\In\PDF\Tree" /target="C:\ADX Demo\Output"  /sourcetype=tree /output=%FILENAME /outputtype=pdf /operation=ocrimagepdf /ocrengine=1

The Extended OCR steps use the parameters listed in the table below.

Parameter	Notes
/operation	The operation that needs to be carried out: - tifftopdf - mergetifftopdf - ocrimagepdf - ocranyfileex
/ocrengine	The OCR engine to use. This must be set to 1 to use the IRIS engine. /ocrengine=1
/source	Source file or folder
/sourcetype	File (default) or Folder
/target	The Target folder
/output	The output filename excluding the extension (which will be added according to the output file type)
/outputtype	One or more of the following, separated by commas if more than one is required. - RTF - PDF - DOCX - CSV* - SML (SpreadsheetML XML file)* - HTM - TXT *These output formats are suitable for table-oriented pages that can be mapped onto a spreadsheet format
/ExtractImages	Whether to convert the images in a PDF document to TIFF or not. - Convert to TIFF: The pages in the PDF document are rasterized and saved as TIFF images - Native: This method places the OCR’ed text directly into a copy of the original PDF rather than creating an entirely new PDF
/Autorotate	Detect page orientation and correct if required
/RemoveBlankPage	Set this to true to remove blank pages from Tiff or PDF documents. Value needs to be set for sensitivity (see below)
/Sensitivity	The sensitivity, from 1 to 100. With high sensitivity, fewer blank pages are detected
/Deskew	Rotates the image to correct its skew angle
/AdvancedDeskew	Set this to true if you want to set the advanced deskew properties below
/AdjustmentMode	Set the behavior regarding dimension adjustment for deskew operation
/ForceDeskew	If turned off, the image is analyzed before rotation and the engine may choose not to rotate the image depending on the analysis result If turned on, the image is rotated to correct skew angle
/Despeckle	Removes all the groups of connected pixels with a number of pixels below the parameter. Suggested range: 1-20
/Workdepth	This parameter (0 – 255) defines how deeply the OCR engine will analyze a page with 255 being the deepest. For poorer quality documents, higher values can give better recognition results
/JPEGQuality	This parameter (0 – 255) determines the compression/quality of color JPEG images in generated PDFs. 0 gives the smallest file size whilst 255 gives the best quality. The default value is 128
/PDFVersion	This determines the PDF version of the generated PDF: - 1.4 - 1.5 - 1.6 - 1.7 - 1.7 Extension Level 3 - 1.7 Extension Level 5 - 1.7 Extension Level 8 - PDF/A-1a - PDF/A-1b - PDF/A-2a - PDF/A-2b - PDF/A-3a - PDF/A-3b
/ValidatePDFA	Set this flag to true if you want to validate the output PDF/A files.
/LanguageDetection	Set this flag to true to enable Auto Language Detection feature. The aim of this feature is to detect the most probable language of a single-language page. If at least one language has been detected, recognition will be performed in the first language candidate that has been detected, and not in the language(s) set through Language or Languages. If it fails to detect a language, recognition will be performed using the language(s) set through Language or Languages. Note: This is set to true by default.
/language	Determines the language to be used for OCR. This may be a comma separated list for multiple languages, for example, /language=1, 2 for German and French. Note: These codes are not the same as those used by the default low-code engine. See the "Extended OCR Languages" table for more information.
/createfolders	Create an output folder if it does not exist. Default true.
/dpi	Sets the DPI of images in the output file. Set to Auto by default, alternatively can be set to 300, 200 or 150 to force a specific resolution.
/nonimagepdf	This allows control over the treatment of non-image-only PDFs, for example, PDFs that have some text in them as well as images. The options are: - OCR: The document will OCRed using the image method defined by “Image Method”. - Raise Error: The task will terminate with an error. If “On Error Continue” is set then behaves as Skip. This is the default. - Skip: The document will not be processed. - Pass Through: The file will not be processed, but a copy of the document will be made and named as if the processing had occurred.
/noocr	Whether are not to perform OCR on the document (Yes to not perform OCR, No to perform OCR).
/AdvancedDespeckle	Set the advanced despeckle settings, advanced despeckle provides advanced image noise reduction features by the image despeckle filter.
/RemoveWhitePixels	By default, despeckle removes black pixels. If set to true, the despeckle will remove white pixels rather than black pixels.
/Dilate	Despeckle removes all the groups of connected pixels with a few pixels below the SpeckleSize parameter. Those connected pixels are not removed if the distance to a larger connected component is below this parameter. As a result, only the isolated pixels get deleted. The maximum value for this property is 20 pixels. The default value is '0'.
/Binarization	Whether or not to perform binarization on the document.
/Brightness	The brightness (higher values will make the result darker).
/Contrast	The contrast (lower values will make the result darker).
/SmoothingLevel	Smoothing may be useful to binarize text with a colored background in order to avoid noisy pixels (0 disables smoothing, higher values smooth more).
/Undithering	Whether or not to use automatic undithering while processing a page. Note: Automatic undithering will be applied only if smoothing is also activated (SmoothingLevel).
/Threshold	Sets the threshold for fixed threshold binarization (0 for automatic threshold computation).
/RemoveLines	Whether or not to remove lines from an image (The image must be black and white).
/HorizontalCleanX	The parameter for cleaning noisy pixels attached to the horizontal lines.
/HorizontalCleanY	The parameter for cleaning noisy pixels attached to the horizontal lines.
/VerticalCleanX	The parameter for cleaning noisy pixels attached to the vertical lines.
/VerticalCleanY	The parameter for cleaning noisy pixels attached to the vertical lines.
/HorizontalDilate	The dilate parameter that helps the detection of horizontal lines.
/VerticalDilate	The dilate parameter that helps the detection of vertical lines.
/HorizontalMaxGap	The maximum horizontal line gap to close. It is useful to remove broken lines.
/VerticalMaxGap	The maximum vertical line gap to close. It is useful to remove broken lines.
/HorizontalMaxThickness	The maximum thickness of the horizontal lines to remove. It is useful to keep vertical lines larger than this parameter. Can be also useful to keep vertical letter strokes.
/VerticalMaxThickness	The maximum thickness of the vertical lines to remove. It is useful to keep horizontal lines larger than this parameter. Can be also useful to keep horizontal letter strokes.
/HorizontalMinLength	The minimum length of the horizontal lines to remove.
/VerticalMinLength	The minimum length of the vertical lines to remove.
/RemoveDarkBorders	Removes the dark surrounding from bitonal, grayscale or color images. The dark surrounding of the image is whitened. Note: The dark border should be touching the edge of the page for this to work.
/RemovePunchHoles	Attempts to remove punch holes from pages. Note: The punch hole algorithm can be used on images with the following minimum dimensions width: 300px, height: 100px (computed for 300 DPI). The minimum height and width can vary with the image resolution.
/Interpolation	Interpolates the source image to the given resolution. This value (the target resolution) must be greater than the source image's resolution.
/InterpolationMode	Sets the interpolation mode.
/KeepOriginalImage	Set this to true if you want to use the pre-processed image for OCR but keep the original image in the output document. The default value is 'true'. Note: This setting will only work if ExtractImages is set to Convert to TIFF.
/KeepDeskew	Set this to true if you want to use the deskewed image in the output document. Note: This property only applies when Keep Original Image is set to No.
/KeepDespeckle	Set this to true if you want to use the despeckled image in the output document. This requires the source image to be black and white. Note: This property only applies when KeepOriginalImage is set to No.
/KeepDarkBorderRemoval	Set this to true if you want to use the image after dark borders have been removed, in the output document. Note: This property only applies when KeepOriginalImage is set to No.
/KeepPunchHoleRemoval	Set this to true if you want to use the image after punch holes have been removed, in the output document. Note: This property only applies when KeepOriginalImage is set to No.
/resourcesfolder	By default, the OCR resources folder is a subfolder of the distribution/extendedocr folder. This option allows the resources to be located elsewhere if required.

Extended OCR Languages

Member Name	Value	Description
English	0	English (American)
German	1
French	2
Spanish	3
Italian	4
British	5
Swedish	6
Danish	7
Norwegian	8
Dutch	9
Portuguese	10
Brazilian	11
Galician	12
Icelandic	13
Greek	14
Czech	15
Hungarian	16
Polish	17
Romanian	18
Slovak	19
Croatian	20
Serbian	21
Slovenian	22
Luxembourgish	23
Finnish	24
Turkish	25
Russian	26
Byelorussian	27
Ukrainian	28
Macedonian	29
Bulgarian	30
Estonian	31
Lithuanian	32
Afrikaans	33
Albanian	34
Catalan	35
Irish Gaelic	36
Scottish Gaelic	37
Basque	38
Breton	39
Corsican	40
Frisian	41
Nynorsk	42
Indonesian	43
Malay	44
Swahili	45
Tagalog	46
Japanese	47
Korean	48
Schinese	49	Simplified Chinese
Tchinese	50	Traditional Chinese
Quecha	51
Aymara	52
Faroese	53
Friulian	54
Greenlandic	55
Haitian_Creole	56
Rhaeto_Roman	57
Sardinian	58
Kurdish	59
Cebuano	60
Bemba	61
Chamorro	62
Fijan	63
Ganda	64
Hani	65
Ido	66
Interlingua	67
Kicongo	68
Kinyarwanda	69
Malagasy	70
Maori	71
Mayan	72
Minangkabau	73
Nahuatl	74
Nyanja	75
Rundi	76
Samoan	77
Shona	78
Somali	79
Sotho	80
Sundanese	81
Tahitian	82
Tonga	83
Tswana	84
Wolof	85
Xhosa	86
Zapotec	87
Javanese	88
Pidgin_Nigeria	89
Occitan	90
Manx	91
Tok_Pisin	92
Bislama	93
Hiligaynon	94
Kapampangan	95
Balinese	96
Bikol	97
Ilocano	98
Madurese	99
Waray	100
None	101	No language, Latin alphabet
Serbian_Latin	102
Latin	103
Latvian	104
Hebrew	105
Numeric	114
Esperanto	115
Maltese	116
Zulu	117
Afaan	118
Asturian	119
AzeriLatin	120
Luba	121
Papamiento	122
Tatar	123
Turkmen	124
Welsh	125
Arabic	126	Note: - You need to set ExtractImageMethod to ConvertToTiff to use Arabic language. - Arabic and English: Works only for Arabic texts with embedded English words. The result for a zone with only English will be empty.
Farsi	127
Mexican	128
BosnianLatin	129	Bosnian (Latin). CharsetCategory.E
BosnianCyrillic	130	Bosnian (Cyrillic). CharsetCategory.D
Moldovan	131	Moldovan. CharsetCategory.E
SwissGerman	132	German (Switzerland). CharsetCategory.C
Tetum	133	Tetum. CharsetCategory.C
Kazakh	134	Kazakh (Cyrillic). CharsetCategory.D
MongolianCyrillic	135	Mongolian (Cyrillic). CharsetCategory.D
UzbekLatin	136	Uzbek (Latin). CharsetCategory.C

Optimize your documents with extended OCR steps

Extended OCR Languages

Was this helpful?

Help us improve

Thank you for your feedback!

Something went wrong. Please try again or let us know.