Export to CSV/XLSX

PDF to CSV/XLSX Operation

Screen Field/Button Description
Output file - csv: Produces a simple csv file.
- xlsx: Produces an Excel file.
Append Page Data to Existing File If set to true, instead of overwriting the output file, Document Automation Server (DAS) Content Extraction appends the page contents to the end of the current file.
Append as WorkSheet When on, this will make DAS Content Extraction add an extra worksheet for each PDF file, instead of appending the data at the end of the first worksheet.

DAS Content Extraction has the capability of recognizing tables in PDF files with very minimal user intervention. The tabular data it extracts is usually written to a csv or xlsx file.

To extract all the tables recognized in the PDF file to the csv file, you can just load the input file and run the job. If you want to tailor the extraction further and group the tables to your own liking, you can read the section below.

DAS Content Extraction Table control is divided into 2 tabs: Document View and Table View.

Document View

This tab shows a visual representation of the PDF file and the table recognized on the file. The table(s) are usually highlighted. Below screenshot shows the Document View tab.

The Document View tab usually shows the table recognized in the first 10 pages. This value can be changed as shown below. It also provides properties that can be changed to alter the way the tables are recognized. These properties are described below.

Document View tab

Value Details
Zoom in/Out Use this control to zoom in or zoom out of the file loaded in the table view control.
Pages Use this control to move through the pages of the loaded file.
Reload Tables Use this button to reload the tables in the PDF file. This also allows you to change settings and reload the tables in the file after the file has been processed.
Crop Table Use this control to allow the user to specify an area where DAS Content Extraction should check for tables. This method produces more accurate result because it eliminates any possible noise.
Do Not Write Header to Output Use this control to avoid writing the Group Header to the output file. This is important if you want to write similar table data from multiple files to one output file. It will also preserve continuity.
Use First Table to Derive All Tables Use this control to utilize the first table in the file as a template to recognise all the other tables in the file.
Use Table Lines to Detect Tables If set to true, DAS Content Extraction uses the PDF graphic line to find the tables in the file.
Use Word Coordinates to Detect Tables Use this control to utilize the coordinates of words on a page to recognise tables.
Extract Only Grouped Tables Use this control to group tables depending on different rules. Turning this setting on will write only the tables grouped to the csv file.
Number of Pages to Check The process of table recognition is CPU intensive; thus, DAS Content Extraction limits the number of tables displayed in the GUI to the first 10 pages. You can increase or reduce it from this textbox.
Minimum Table Gap (pt) If the space between two preceding lines is greater than the value given here, that space will be interpreted as a table break.
Table Space (pt) If the space between two words is greater than this value, DAS Content Extraction will recognise that space as a column.
Note: If the value is 0 or less, DAS Content Extraction will ignore this and work out the space automatically.
Minimum Number of Rows DAS Content Extraction will only recognize tables with a higher number of rows than this value.
Minimum Number of Columns DAS Content Extraction will only recognize tables with a higher number of columns than this value.

Table View

The Table View shows the data that was successfully extracted from the PDF file. It is divided into two parts as explained below:

  • The controls on the left are divided into two tabs: Tables and Grouped Tables. They contain a list of tables found and other controls for grouping tables.

  • The right part contains a Data grid which displays the data in the Table/Grouped Table Selected.

Table View tab

The following section dives deeper into the Table View:

Tables

This tab contains only two controls as mentioned below:

Value Details
UP/Down Control Use these buttons to select the next table in the list.
Note: The data grid on the right shows the contents of the selected table.
The table list This contains a list of all the tables recognized in the document. Clicking any one of them will load the selected table in the data grid on the right.

Grouped Tables

By default, DAS Content Extraction can extract all structured data in a PDF file and write it to a .csv or .xlsx file. However, we understand that users might not want to extract all the tables found in a document or might want to group similar tables into one group with a common header. To handle these scenarios, we also provide the ability to group/exclude tables and edit the headers. The following sections will explain this in more depth:

Grouping Tables

Grouping Criteria

This section explains how users can group various tables based on the grouping criteria:

Grouping Criteria

Header Rows: If you are writing a group of say 5 tables to a .csv file, the table header will be repeated for each table. Hence, the final csv file won’t be very clean. To handle this, DAS Content Extraction will skip the first “n” rows provided in the Header Rows control.

Group by Column Count

This criteria groups all the tables with the same number of columns. It will also allow to change the provided the header rows (as explained above).

Where the Rows Below are the Same

This criteria checks all the tables for the row specified and group the tables based on the identical rows. The example below groups all the tables with the same first row as one group.

Unfortunately, you can’t save headers for this type of criteria. It is because the table types recognized here are not very predictable and will vary across different files. We decided to disable that feature to avoid putting wrong headers on tables.

Clicking the Use Row as Header checkbox will use the row matched as the header row.

Where the Rows Below are the Same

Where the Cells Below are the Same

This criteria is the same as the Where the rows below are the same. The only difference is that it compares a single cell instead of the whole row. Unfortunately, you can’t save headers for this type of criteria. It is because the table types recognized here are not very predictable and will vary across different files. We decided to disable that feature to avoid putting wrong headers on tables.

Clicking the Use Row as Header checkbox will use the row matched as the header row.

Where the Cells Below are the Same

Where Cells Below Match Expression

This criteria groups all the tables where the cells provided below matches the regular expression given.

Information

You can use the ‘+’ button to add more items.

This criteria will also allow to change the provided the header rows.

Clicking the Use Row as Header checkbox will use the row matched as the header row.

Where the Cells Below Match Expression

Where Columns Below Match Expression

This criteria groups all the tables where the columns provided below matches the regular expression given.

Information

You can use the ‘+’ button to add more items.

This criteria will also allow to change the provided the header rows.

Clicking the Use Row as Header checkbox will use the row matched as the header row.

Where Columns Below Match Expression

Exclusion Criteria

DAS Content Extraction also allows you to exclude table with characteristics they are not interested in. To use this feature, click the Show Exclusion Criteria Expander to view the various options:

Show Exclusion Criteria

Exclude by Table ID

This option will exclude all the tables for which the ID’s are provided.

Exclude by Table ID

Exclude Cells Below Match Expression

This option excludes all the tables where the cell provided below matches the regular expression given.

Information

You can use the ‘+’ button to add more items.

Exclude Cells Below Match Expression

Exclude Where Columns Below Match Expression

This option excludes all the tables where the columns provided below matches the regular expression given.

Information

You can use the ‘+’ button to add more items.

Exclude Where Columns Below Match Expression

Data Grid

The main use of the data grid is to displaythe data that has been extracted from the PDF file. You can also use it to Edit the headers by clicking the pencil icon that appears next to the header as shown below. Once finished, click the Save job button to proceed.

Data Grid