Overload | Description |
---|---|
GetPageTextWithCoords(String) | Returns the whole text, regardless if visible or hidden, of the current page of the loaded PDF document including the text properties such as the bounding box coordinates, the font information, the text mode and the text size. The extracted text from the current page is divided by words. Each word together with its text and font properties is recorded in one separated line. The space character (between the words in text) is also considered as a word. Two or more spaces in a row are considered as one word. The resulting string for one word is formatted this way:
the horizontal (X) coordinate of the top left point of the rendering area + [FieldSeparator] + the vertical (Y) coordinate of the top left point of the rendering area + [FieldSeparator] + the horizontal (X) coordinate of the top right point of the rendering area + [FieldSeparator] + the vertical (Y) coordinate of the top right point of the rendering area + [FieldSeparator] + the horizontal (X) coordinate of the bottom right point of the rendering area + [FieldSeparator] + the vertical (Y) coordinate of the bottom right point of the rendering area + [FieldSeparator] + the horizontal (X) coordinate of the bottom left point of the rendering area + [FieldSeparator] + the vertical (Y) coordinate of the bottom left point of the rendering area + [FieldSeparator] + extracted word + [FieldSeparator] + font name + [FieldSeparator] + font box height + [FieldSeparator] + text mode + [FieldSeparator] + text size + EOL The rendering area means the rectangle area on the page where the extracted word is really situated (rendered). You can use the provided coordinates to easily calculate the dimensions of this area and the text rotation angle, for more details please refer to the second example below. You can also benefit from using the GuessPageTextRotation method if the presented text is rotated in various angles on the current page. The result for the current page should contain exactly the same number of lines as is the count of all words including the space-words in the text on that page. |
GetPageTextWithCoords(String,TextExtractionOutputInfo) | Returns various information about extracted text, regardless if visible or hidden, on the current page of the loaded PDF document such as the bounding box coordinates, the font information, the text mode and the text size, glyph widths and glyph character representations. The extracted text from the current page is divided by words. This method allows to include and exclude each information to better suit the use case. Each word together with its text and font properties completed with widths of single characters is recorded in one separated line. The space character (between the words in text) is also considered as a word. Two or more spaces in a row are considered as one word. The resulting string for one word is formatted this way if all flags are set:
the horizontal (X) coordinate of the top left point of the rendering area + [FieldSeparator] + the vertical (Y) coordinate of the top left point of the rendering area + [FieldSeparator] + the horizontal (X) coordinate of the top right point of the rendering area + [FieldSeparator] + the vertical (Y) coordinate of the top right point of the rendering area + [FieldSeparator] + the horizontal (X) coordinate of the bottom right point of the rendering area + [FieldSeparator] + the vertical (Y) coordinate of the bottom right point of the rendering area + [FieldSeparator] + the horizontal (X) coordinate of the bottom left point of the rendering area + [FieldSeparator] + the vertical (Y) coordinate of the bottom left point of the rendering area + [FieldSeparator] + extracted word + [FieldSeparator] + font name + [FieldSeparator] + font box height + [FieldSeparator] + text mode + [FieldSeparator] + text size + [FieldSeparator] + array of widths for each single glyph of the extracted word delimited by the [FieldSeparator] + array of character representations for each single glyph of the extracted word delimited by the [FieldSeparator] + EOL The rendering area means the rectangle area on the page where the extracted word is really situated (rendered). You can use the provided coordinates to easily calculate the dimensions of this area, the coordinates of the single characters and the text rotation angle. You can also benefit from using the GuessPageTextRotation method if the presented text is rotated in various angles on the current page. The result for the current page should contain exactly the same number of lines as is the count of all words including the space-words in the text on that page. |