GetPageTextWithCoords(String) Method
In This Topic
Returns the whole text, regardless if visible or hidden, of the current page of the loaded PDF document including the text properties such as the bounding box coordinates, the font information, the text mode and the text size. The extracted text from the current page is divided by words. Each word together with its text and font properties is recorded in one separated line. The space character (between the words in text) is also considered as a word. Two or more spaces in a row are considered as one word. The resulting string for one word is formatted this way:
the horizontal (X) coordinate of the top left point of the rendering area + [FieldSeparator] +
the vertical (Y) coordinate of the top left point of the rendering area + [FieldSeparator] +
the horizontal (X) coordinate of the top right point of the rendering area + [FieldSeparator] +
the vertical (Y) coordinate of the top right point of the rendering area + [FieldSeparator] +
the horizontal (X) coordinate of the bottom right point of the rendering area + [FieldSeparator] +
the vertical (Y) coordinate of the bottom right point of the rendering area + [FieldSeparator] +
the horizontal (X) coordinate of the bottom left point of the rendering area + [FieldSeparator] +
the vertical (Y) coordinate of the bottom left point of the rendering area + [FieldSeparator] +
extracted word + [FieldSeparator] +
font name + [FieldSeparator] +
font box height + [FieldSeparator] +
text mode + [FieldSeparator] +
text size + EOL
The rendering area means the rectangle area on the page where the extracted word is really situated (rendered). You can use the provided coordinates to easily calculate the dimensions of this area and the text rotation angle, for more details please refer to the second example below. You can also benefit from using the GdPicturePDF.GuessPageTextRotation method if the presented text is rotated in various angles on the current page.
The result for the current page should contain exactly the same number of lines as is the count of all words including the space-words in the text on that page.
Syntax
'Declaration
Public Overloads Function GetPageTextWithCoords( _
ByVal As String _
) As String
public string GetPageTextWithCoords(
string
)
public function GetPageTextWithCoords(
: String
): String;
public function GetPageTextWithCoords(
: String
) : String;
public: string* GetPageTextWithCoords(
string*
)
public:
String^ GetPageTextWithCoords(
String^
)
Parameters
- FieldSeparator
- The string that is used to delimit the above enumerated fields in the resulting text.
Return Value
The whole page text divided by one word per text line including the text coordinates and its properties in the above described format. The
GdPicturePDF.GetStat method can be subsequently used to determine if this method has been successful.
Example
The first example shows you how to extract the whole text of the PDF document with its coordinates and other properties to a text file.
The second example demonstrates how you can use the provided coordinates to calculate dimensions of rendering areas and possible text rotation angles.
How to extract the whole text of the PDF document with its coordinates and other properties to a text file. Resulting strings for the individual pagesare separated with the text that includes the page number.
Dim caption As String = "Example: GetPageTextWithCoords"
Dim gdpicturePDF As New GdPicturePDF()
Dim status As GdPictureStatus = gdpicturePDF.LoadFromFile("test.pdf", False)
If status = GdPictureStatus.OK Then
Dim text_file As New System.IO.StreamWriter("text_with_coord.txt")
Dim pageCount As Integer = gdpicturePDF.GetPageCount()
status = gdpicturePDF.GetStat()
If status = GdPictureStatus.OK Then
Dim text As String = ""
Dim message As String = Nothing
For i As Integer = 1 To pageCount
status = gdpicturePDF.SelectPage(i)
If status = GdPictureStatus.OK Then
message = "Page: " + i.ToString() + " Status: " + status.ToString()
text_file.WriteLine(message)
'You can use your own separator here.
text = gdpicturePDF.GetPageTextWithCoords("---")
status = gdpicturePDF.GetStat()
If status = GdPictureStatus.OK Then
text_file.WriteLine(text)
Else
MessageBox.Show("The GetPageTextWithCoords() method has failed with the status: " + status.ToString(), caption)
End If
Else
MessageBox.Show("The SelectPage() method has failed with the status: " + status.ToString(), caption)
End If
Next
Else
MessageBox.Show("The GetPageCount() method has failed with the status: " + status.ToString(), caption)
End If
text_file.Close()
Else
MessageBox.Show("The file can't be loaded.", caption)
End If
MessageBox.Show("Searching finished.", caption)
gdpicturePDF.Dispose()
string caption = "Example: GetPageTextWithCoords";
GdPicturePDF gdpicturePDF = new GdPicturePDF();
GdPictureStatus status = gdpicturePDF.LoadFromFile("test.pdf", false);
if (status == GdPictureStatus.OK)
{
System.IO.StreamWriter text_file = new System.IO.StreamWriter("text_with_coord.txt");
int pageCount = gdpicturePDF.GetPageCount();
status = gdpicturePDF.GetStat();
if (status == GdPictureStatus.OK)
{
string text = "";
string message = null;
for (int i = 1; i <= pageCount; i++)
{
status = gdpicturePDF.SelectPage(i);
if (status == GdPictureStatus.OK)
{
message = "Page: " + i.ToString() + " Status: " + status.ToString();
text_file.WriteLine(message);
//You can use your own separator here.
text = gdpicturePDF.GetPageTextWithCoords("---");
status = gdpicturePDF.GetStat();
if (status == GdPictureStatus.OK)
{
text_file.WriteLine(text);
}
else
{
MessageBox.Show("The GetPageTextWithCoords() method has failed with the status: " + status.ToString(), caption);
}
}
else
{
MessageBox.Show("The SelectPage() method has failed with the status: " + status.ToString(), caption);
}
}
}
else
{
MessageBox.Show("The GetPageCount() method has failed with the status: " + status.ToString(), caption);
}
text_file.Close();
}
else
{
MessageBox.Show("The file can't be loaded.", caption);
}
MessageBox.Show("Searching finished.", caption);
gdpicturePDF.Dispose();
How to calculate the dimensions of the rendering area for the first word and how to find out the angle, if the first word is rotated.
Dim caption As String = "Example: GetPageTextWithCoords"
Using gdpicturePDF As GdPicturePDF = New GdPicturePDF()
If gdpicturePDF.LoadFromFile("test.pdf", False) = GdPictureStatus.OK Then
gdpicturePDF.SelectPage(1)
Dim text As String = gdpicturePDF.GetPageTextWithCoords("~")
If gdpicturePDF.GetStat() = GdPictureStatus.OK Then
Dim coord As String() = text.Split("~")
'Considering only the first word as an example. Let assume the text is rotated.
'Calculating the vector to determine the height of the rendering area for the first word.
Dim vectorXH As Double = Double.Parse(coord(0)) - Double.Parse(coord(6))
Dim vectorYH As Double = Double.Parse(coord(7)) - Double.Parse(coord(1))
'Calculating the height of the area.
Dim areaHeight As Double = Math.Sqrt(vectorXH * vectorXH + vectorYH * vectorYH)
'Calculating the vector to determine the width of the rendering area for the first word.
Dim vectorXW As Double = Double.Parse(coord(6)) - Double.Parse(coord(4))
Dim vectorYW As Double = Double.Parse(coord(7)) - Double.Parse(coord(5))
'Calculating the width of the area.
Dim areaWidth As Double = Math.Sqrt(vectorXW * vectorXW + vectorYW * vectorYW)
'Calculating the text rotation angle.
Dim angle As Double = Math.Atan2(vectorXH, vectorYH) * (180 / Math.PI)
'Be aware that the resulting angle is relative to the chosen base axis.
'Continue...
Else
MessageBox.Show("The GetPageTextWithCoords() method has failed with the status: " + gdpicturePDF.GetStat().ToString(), caption)
End If
Else
MessageBox.Show("The file can't be loaded.", caption)
End If
End Using
string caption = "Example: GetPageTextWithCoords";
using (GdPicturePDF gdpicturePDF = new GdPicturePDF())
{
if (gdpicturePDF.LoadFromFile("test.pdf", false) == GdPictureStatus.OK)
{
gdpicturePDF.SelectPage(1);
string text = gdpicturePDF.GetPageTextWithCoords("~");
if (gdpicturePDF.GetStat() == GdPictureStatus.OK)
{
string[] coord = text.Split('~');
//Considering only the first word as an example. Let assume the text is rotated.
//Calculating the vector to determine the height of the rendering area for the first word.
double vectorXH = double.Parse(coord[0]) - double.Parse(coord[6]);
double vectorYH = double.Parse(coord[7]) - double.Parse(coord[1]);
//Calculating the height of the area.
double boxHeight = Math.Sqrt(vectorXH * vectorXH + vectorYH * vectorYH);
//Calculating the vector to determine the width of the rendering area for the first word.
double vectorXW = double.Parse(coord[6]) - double.Parse(coord[4]);
double vectorYW = double.Parse(coord[7]) - double.Parse(coord[5]);
//Calculating the width of the area.
double boxWidth = Math.Sqrt(vectorXW * vectorXW + vectorYW * vectorYW);
//Calculating the text rotation angle.
double angle = Math.Atan2(vectorXH, vectorYH) * (180 / Math.PI);
//Be aware that the resulting angle is relative to the chosen base axis.
//Continue...
}
else
{
MessageBox.Show("The GetPageTextWithCoords() method has failed with the status: " + gdpicturePDF.GetStat().ToString(), caption);
}
}
else
{
MessageBox.Show("The file can't be loaded.", caption);
}
}
See Also