Export to CSV/XLSX

Screen Field / Button Description
Output file csv: Produces a simple csv file.
xlsx: Produces and Excel file.
Append Page Data to Existing File If set to true, instead of overwriting the output file, Kingfisher will append the contents to the end of the current file.
Append as WorkSheet When on, this will make Kingfisher add an extra worksheet for each PDF file, instead of appending the data at the end on the first Worksheet.

Aquaforest Kingfisher has the capability of recognizing tables in PDF files with very minimal user intervention. The tabular data it extracts is usually written to a csv or xlsx file.

To extract all the tables recognized in the PDF file to the csv file you can just load the input file and run the job. If you want to tailor the extraction further and group the tables to your own liking, you can read the section below.

The Kingfisher Table control is divided into 2 tabs called the “Document View” and the “Table View”

The Document View

This tab shows a visual representation of the PDF file and the table recognized on the file, the table are usually highlighted. Below is a screen shot showing the Document View tab.

The document view usually shows the table recognized in the first 10 pages, this value can be changed as shown below. It also provides some properties that can be changed to alter the way the tables are recognized. These properties are described below.

Value Details
Zoom in/Out You can use this to zoom in or zoom out of the file loaded in the table viewer control.
Pages Use this control to move through the pages of the loaded file.
Reload Tables Clicking this button will make Kingfisher reload the tables in the PDF file, this allows you to change some settings and reload the tables in the file after the file has been processed.
Crop Table Switching this on will allow the user to specify an area where Kingfisher should check for tables, this method produces more accurate result because it eliminates any possible noise.
Do Not Write Header to Output Switching this on will not write the Group Header to the output file, this is important if you want to write similar table data from multiple files to one output file, it will preserver continuity.
Use First Table to Derive All Tables This setting instructs Kingfisher to use the first table in the file as a template to recognise all the other tables in the file.
Use Table Lines to Detect Tables If set to true, Kingfisher will use the PDF graphic line to find the tables in the file.
Use Word Coordinates to Detect Tables This setting instructs Kingfisher to use the coordinates of words on a page to recognise tables.
Extract Only Grouped Tables Kingfisher gives the user the ability to group tables depending on different rules, turning this setting on will write only the tables grouped to the csv file.
Number of Pages to Check The process of table recognition is CPU intensive; thus, Kingfisher limits the number of tables displayed in the GUI to the first 10 pages. You can increase or reduce it using the textbox.
Minimum Table Gap (pt) If the space between two preceding lines is greater than the value given here, that space will be interpreted as a table break.
Table Space (pt) If the space between two words is greater than this value, Kingfisher will recognise that space as a column. Note if the value is 0 or less, Kingfisher will ignore this and work out the space automatically.
Minimum Number of Rows Kingfisher will only recognize tables with a higher number of rows than this value.
Minimum Number of Columns Kingfisher will only recognize tables with a higher number of columns than this value.

Table View

The Table View shows the data that was successfully extracted from the PDF file, it is divided into two parts explained below:

  • The controls on the left are divided into two tabs named “Tables” and “Grouped Tables”, the tabs. They contain a list of tables found and some other controls for grouping tables.

  • The right part contains a Data grid which displays the data in the Table/Grouped Table Selected.

The next section will dive deeper into the “Table View”.

Tables

This is a very straight forward tab; it contains only two controls.

Value Details
UP/Down Control Use these buttons to select the next table in the list. Note the data grid on the right shows the contents of the selected table.
The table list This contains a list of all the tables recognized in the document. Clicking on any one of them will load the selected table in the data grid on the right.

Grouped Tables

By default, Kingfisher can extract all structured data in a PDF file and write it to a “.csv” or “.xlsx” file. We also understand that users might not want to extract all the tables found in a document or the user might want to group some similar tables into one group with a common header.

Because of the issues raised in the paragraph above, we have decided to give the user the ability to group/exclude tables and edit the headers. The following sections will explain this in more depth.

Grouping Criteria

This section explains how users can group various tables based on the grouping criteria offered by Kingfisher.

  • Header Rows: If you are writing a group of say 5 tables to a “.csv” file, the table header will be repeated for each table thus the final CSV file won’t be very clean. As result, Kingfisher will skip the first “n” rows provided in the header rows control.

Group by Column Count

This is a straight forward option; it groups all the tables with the same number of columns. This option will also allow to change the provide the header rows (See 1.2.3).

Where the rows below are the same

This criterion will check all the tables for the row specified and group the tables based on identical rows. The example below will group all the tables with the same first row as one group.

Unfortunately, you can’t save headers for this type of criteria, this is because the table types recognized here are not very predictable and will vary across different files. We decided to disable that feature to avoid putting wrong headers on tables.

Checking the Use Row as Header Check Box will use the Row Matched as the header row.

Where the cells below are the same

This is the same as the previous option, the only difference is that it compares a single cell instead of the whole row. Unfortunately, you can’t save headers for this type of criteria, this is because the table types recognized here are not very predictable and will vary across different files. We decided to disable that feature to avoid putting wrong headers on tables.

Checking the Use Row as Header Check Box will use the Row Matched as the header row.

Where cells below match expression

This option groups all the tables where the cells provided below matches the Regular Expression given. Note, you can use the ‘+’ button to add more items

This option will also allow to change the provide the header rows (See 1.2.3).

Checking the Use Row as Header Check Box will use the Row Matched as the header row.

Where Columns below match expression

This option groups all the tables where the columns provided below matches the Regular Expression given. Note, you can use the ‘+’ button to add more items

This option will also allow to change the provide the header rows (See 1.2.3).

Checking the Use Row as Header Check Box will use the Row Matched as the header row.

Exclusion criteria

Kingfisher also allows the Users to exclude table with characteristics they are not interested in, to use this feature you will have click on the “Show Exclusion Criteria” Expander to view the various option

Exclude by Table ID

This option will exclude all the tables that their ID’s are provided below.

Exclude cells below match expression

This option excludes all the tables where the cell provided below matches the Regular Expression given. Note, you can use the ‘+’ button to add more items

Exclude Where Columns below match expression

This option excludes all the tables where the columns provided below matches the Regular Expression given. Note, you can use the ‘+’ button to add more items.

Data grid

The main use of the data grid is to show the users the data that has been extracted from the PDF file, you can also use it to Edit headers (If the pen symbol appears next to the header) as shown Below. After editing, just click the save job button.