Extended OCR Steps

The Autobahn DX command line interface also supports the Extended OCR module.

Using /ocrengine=1 as a parameter is a requirement.

autobahndx.exe /operation=[operation name] /source=[tiff file or folder] /output=[output file] /target=[target folder] [/option=value]…
Examples

1. Generate a searchable PDF c:\out\outfile.pdf and a Word file c:\out\outfile.docx from a multi-page TIFF file

autobahndx.exe /source=“C:\ADX Demo\In\PDF\File\US2007246939A1.pdf” /sourcetype=file /target=“C:\ADX Demo\Output” /output=%FILENAME /outputtype=pdf,docx /operation=ocrimagepdf /ocrengine=1

2. Generate a searchable PDF file from a folder of TIFF and JPEG files, with Deskew and page orientation detection and correction.

autobahndx.exe /source=“C:\ADX Demo\In\TIFF\Folder” /sourcetype=folder /target=“C:\ADX Demo\Output” /output=outfilef /outputtype=pdf /autorotate /deskew /operation=mergetifftopdf /ocrengine=1

3. Generate searchable PDF files from image PDF files found in a folder and subfolders, while keeping the original file names.

autobahndx.exe /source=“C:\ADX Demo\In\PDF\Tree” /target=“C:\ADX Demo\Output” /sourcetype=tree /output=%FILENAME /outputtype=pdf /operation=ocrimagepdf /ocrengine=1

The Extended OCR steps use the parameters listed in the table below.

ParameterNotes
/operation

The operation that needs to be carried out:

  • tifftopdf

  • mergetifftopdf

  • ocrimagepdf

  • ocranyfileex

/ocrengine

The OCR engine to use. This must be set to 1 to use the IRIS engine.

/ocrengine=1

/sourceSource file or folder
/sourcetypeFile (default) or Folder
/targetThe Target folder
/outputThe output filename excluding the extension (which will be added according to the output file type).
/outputtype

One or more of the following, separated by commas if more than one is required.

RTF

PDF

DOCX

CSV*

SML (SpreadsheetML XML file)*

HTM

TXT

*These output formats are suitable for table-oriented pages that can be mapped onto a spreadsheet format.

/ExtractImages

Whether to convert the images in a PDF document to TIFF or not.

  • Convert to TIFF – The pages in the PDF document are rasterized and saved as TIFF images

  • Native - This method places the OCR’ed text directly into a copy of the original PDF rather than creating an entirely new PDF.

/AutorotateDetect page orientation and correct if required
/RemoveBlankPageSet this to true to remove blank pages from Tiff or PDF documents. Value needs to be set for sensitivity (see below).
/SensitivityThe sensitivity, from 1 to 100. With high sensitivity, fewer blank pages are detected.
/DeskewRotates the image to correct its skew angle.
/AdvancedDeskewSet this to true if you want to set the advanced deskew properties below.
/AdjustmentModeSet the behavior regarding dimension adjustment for deskew operation.
/ForceDeskewIf turned off, the image is analyzed before rotation and the engine may choose not to rotate the image depending on the analysis result If turned on, the image is rotated to correct skew angle.
/DespeckleRemoves all the groups of connected pixels with a number of pixels below the parameter. Suggested range: 1-20.
/WorkdepthThis parameter (0 – 255) defines how deeply the OCR engine will analyze a page with 255 being the deepest. For poorer quality documents, higher values can give better recognition results.
/JPEGQualityThis parameter (0 – 255) determines the compression/quality of color JPEG images in generated PDFs. 0 gives the smallest file size whilst 255 gives the best quality. The default value is 128.
/PDFVersion

This determines the PDF version of the generated PDF:

  • 1.4

  • 1.5

  • 1.6

  • 1.7

  • 1.7 Extension Level 3

  • 1.7 Extension Level 5

  • 1.7 Extension Level 8

  • PDF/A-1a

  • PDF/A-1b

  • PDF/A-2a

  • PDF/A-2b

  • PDF/A-3a

  • PDF/A-3b

/ValidatePDFASet this flag to true if you want to validate the output PDF/A files.
/LanguageDetection

Set this flag to true to enable Auto Language Detection feature. The aim of this feature is to detect the most probable language of a single-language page.

If at least one language has been detected, recognition will be performed in the first language candidate that has been detected, and not in the language(s) set through Language or Languages. If it fails to detect a language, recognition will be performed using the language(s) set through Language or Languages.

Note: This is set to true by default.

/language

Determines the language to be used for OCR. This may be a comma separated list for multiple languages e.g. /language=1, 2 for German and French. Note that these codes are not the same as those used by the default Aquaforest engine.

See the full table of languages and their codes in the next section.

/createfoldersCreate an output folder if it does not exist. Default true.
/dpiSets the DPI of images in the output file. Set to Auto by default, alternatively can be set to 300, 200 or 150 to force a specific resolution.
/nonimagepdf

This allows control over the treatment of non-image-only PDFs, i.e., PDFs that have some text in them as well as images. The options are:

  • OCR. The document will OCRed using the image method defined by “Image Method”

  • Raise Error. The task will terminate with an error. If “On Error Continue” is set then behaves as Skip. This is the default.

  • Skip. The document will not be processed.

  • Pass Through. The file will not be processed, but a copy of the document will be made and named as if the processing had occurred.

/noocrWhether are not to perform OCR on the document (Yes to not perform OCR, No to perform OCR).
/AdvancedDespeckleSet the advanced despeckle settings, advanced despeckle provides advanced image noise reduction features by the image despeckle filter.
/RemoveWhitePixelsBy default, despeckle removes black pixels. If set to true, the despeckle will remove white pixels rather than black pixels.
/Dilate

Despeckle removes all the groups of connected pixels with a few pixels below the SpeckleSize parameter. Those connected pixels are not removed if the distance to a larger connected component is below this parameter. As a result, only the isolated pixels get deleted. The maximum value for this property is 20 pixels.

The default value is '0'.

/BinarizationWhether or not to perform binarization on the document.
/BrightnessThe brightness (higher values will make the result darker).
/ContrastThe contrast (lower values will make the result darker).
/SmoothingLevelSmoothing may be useful to binarize text with a colored background in order to avoid noisy pixels (0 disables smoothing, higher values smooth more).
/UnditheringWhether or not to use automatic undithering while processing a page. NOTE: Automatic undithering will be applied only if smoothing is also activated (SmoothingLevel).
/ThresholdSets the threshold for fixed threshold binarization (0 for automatic threshold computation).
/RemoveLinesWhether or not to remove lines from an image (The image must be black and white).
/HorizontalCleanXThe parameter for cleaning noisy pixels attached to the horizontal lines.
/HorizontalCleanYThe parameter for cleaning noisy pixels attached to the horizontal lines.
/VerticalCleanXThe parameter for cleaning noisy pixels attached to the vertical lines.
/VerticalCleanYThe parameter for cleaning noisy pixels attached to the vertical lines.
/HorizontalDilateThe dilate parameter that helps the detection of horizontal lines.
/VerticalDilateThe dilate parameter that helps the detection of vertical lines.
/HorizontalMaxGapThe maximum horizontal line gap to close. It is useful to remove broken lines.
/VerticalMaxGapThe maximum vertical line gap to close. It is useful to remove broken lines.
/HorizontalMaxThicknessThe maximum thickness of the horizontal lines to remove. It is useful to keep vertical lines larger than this parameter. Can be also useful to keep vertical letter strokes.
/VerticalMaxThicknessThe maximum thickness of the vertical lines to remove. It is useful to keep horizontal lines larger than this parameter. Can be also useful to keep horizontal letter strokes.
/HorizontalMinLengthThe minimum length of the horizontal lines to remove.
/VerticalMinLengthThe minimum length of the vertical lines to remove.
/RemoveDarkBordersRemoves the dark surrounding from bitonal, grayscale or color images. The dark surrounding of the image is whitened (Note: The dark border should be touching the edge of the page for this to work).
/RemovePunchHolesAttempts to remove punch holes from pages. Note: The punch hole algorithm can be used on images with the following minimum dimensions width: 300px, height: 100px (computed for 300 DPI). The minimum height and width can vary with the image resolution.
/InterpolationInterpolates the source image to the given resolution. This value (the target resolution) must be greater than the source image's resolution.
/InterpolationModeSets the interpolation mode.
/KeepOriginalImage

Set this to true if you want to use the pre-processed image for OCR but keep the original image in the output document. The default value is 'true'.

Note: This setting will only work if ExtractImages is set to Convert to TIFF

/KeepDeskew

Set this to true if you want to use the deskewed image in the output document.

Note: This property only applies when Keep Original Image is set to No

/KeepDespeckle

Set this to true if you want to use the despeckled image in the output document. This requires the source image to be black and white.

Note: This property only applies when KeepOriginalImage is set to No

/KeepDarkBorderRemoval

Set this to true if you want to use the image after dark borders have been removed, in the output document.

Note: This property only applies when KeepOriginalImage is set to No

/KeepPunchHoleRemoval

Set this to true if you want to use the image after punch holes have been removed, in the output document.

Note: This property only applies when KeepOriginalImage is set to No

/resourcesfolderBy default, the OCR resources folder is a subfolder of the distribution/extendedocr folder. This option allows the resources to be located elsewhere if required.

Extended OCR Languages

Member name

Value

Description

English

0

English (American)

German

1

French

2

Spanish

3

Italian

4

British

5

Swedish

6

Danish

7

Norwegian

8

Dutch

9

Portuguese

10

Brazilian

11

Galician

12

Icelandic

13

Greek

14

Czech

15

Hungarian

16

Polish

17

Romanian

18

Slovak

19

Croatian

20

Serbian

21

Slovenian

22

Luxembourgish

23

Finnish

24

Turkish

25

Russian

26

Byelorussian

27

Ukrainian

28

Macedonian

29

Bulgarian

30

Estonian

31

Lithuanian

32

Afrikaans

33

Albanian

34

Catalan

35

Irish Gaelic

36

Scottish Gaelic

37

Basque

38

Breton

39

Corsican

40

Frisian

41

Nynorsk

42

Indonesian

43

Malay

44

Swahili

45

Tagalog

46

Japanese

47

Korean

48

Schinese

49

Simplified Chinese

Tchinese

50

Traditional Chinese

Quecha

51

Aymara

52

Faroese

53

Friulian

54

Greenlandic

55

Haitian_Creole

56

Rhaeto_Roman

57

Sardinian

58

Kurdish

59

Cebuano

60

Bemba

61

Chamorro

62

Fijan

63

Ganda

64

Hani

65

Ido

66

Interlingua

67

Kicongo

68

Kinyarwanda

69

Malagasy

70

Maori

71

Mayan

72

Minangkabau

73

Nahuatl

74

Nyanja

75

Rundi

76

Samoan

77

Shona

78

Somali

79

Sotho

80

Sundanese

81

Tahitian

82

Tonga

83

Tswana

84

Wolof

85

Xhosa

86

Zapotec

87

Javanese

88

Pidgin_Nigeria

89

Occitan

90

Manx

91

Tok_Pisin

92

Bislama

93

Hiligaynon

94

Kapampangan

95

Balinese

96

Bikol

97

Ilocano

98

Madurese

99

Waray

100

None

101

No language, Latin alphabet

Serbian_Latin

102

Latin

103

Latvian

104

Hebrew

105

Numeric

114

Esperanto

115

Maltese

116

Zulu

117

Afaan

118

Asturian

119

AzeriLatin

120

Luba

121

Papamiento

122

Tatar

123

Turkmen

124

Welsh

125

Arabic

126

Note:

- You need to set about:blank to about:blank to use Arabic language

- Arabic and English: Works only for Arabic texts with embedded English words. The result for a zone with only English will be empty.

Farsi

127

Mexican

128

BosnianLatin

129

Bosnian (Latin). CharsetCategory.E

BosnianCyrillic

130

Bosnian (Cyrillic). CharsetCategory.D

Moldovan

131

Moldovan. CharsetCategory.E

SwissGerman

132

German (Switzerland). CharsetCategory.C

Tetum

133

Tetum. CharsetCategory.C

Kazakh

134

Kazakh (Cyrillic). CharsetCategory.D

MongolianCyrillic

135

Mongolian (Cyrillic). CharsetCategory.D

UzbekLatin

136

Uzbek (Latin). CharsetCategory.C