Product Overview
Version 1. 2
April 2022
Aquaforest Searchlight Tagger is a tool that further enhances findability and classification of documents in SharePoint by automatically generating and tagging metadata based on the contents of the documents via rules, taxonomies, barcodes, PDF forms, XMP and integration with NLP services.
The Business Problem: Drowning in Data, Thirsting for Information
According to an Enterprise Search and Findability survey conducted by Findwise in 2016, more than one-third of respondents stated that it is difficult for users to find information in their organisations and two-thirds of respondents stated that more than 50% of employees are dependent on good findability in their daily work.
With the ever-increasing growth of data being stored to document stores such as Microsoft SharePoint and the increased expectations of good findability, there is a need for a solution to automatically enrich the (raw) data by extracting valuable information from them, which can then be used to enhance findability – a critical need for business success.
The extracted information can be added as metadata (also known as tagging) to the documents in SharePoint. Metadata is key to improve findability and retrieve accurate and relevant information in SharePoint. Documents stored in SharePoint may often be lacking key metadata required to enable straightforward metadata searches. As a result, when a query is performed, all documents containing the search term are returned, with no possibility of further refining the search results.
Tagging documents with good metadata improves their ranking in search results by prioritising query matches against the metadata (as compared to matches against the text within the documents), thus providing more relevant results. Moreover, the results can be further refined through faceted navigation. With faceted navigation, multiple filters on various additional metadata can be applied incrementally to drill down to get the correct document/information.
Presently, tagging in organisations is performed manually. According to the SharePoint and Office 365 State of the Market survey by Concept Searching in 2016, 91% of organisations perform some type of manual tagging. However, only 8.4% were satisfied with their tagging accuracy. This is because it is impossible to expect broad sets of employees to accurately tag documents that are often several 100 pages long. Besides, manual tagging is subjective and therefore prone to inconsistencies and ambiguity, not to mention it is also very time consuming. Inconsistent metadata or worst - wrong metadata, negatively affects search results and eventually the business itself.
Consequently, all things considered, automated tagging is the likely practical solution. Automatically generated metadata can be complemented by manual inspections and corrections to improve consistency, accuracy, speed and cost of metadata tagging.
The Solution: Aquaforest Searchlight Tagger
Aquaforest Searchlight Tagger is a tool that can be configured to automatically extract and/or generate metadata from new and existing documents in SharePoint and tag them accordingly to further enhance findability and classification. It is a stand-alone client application and can be installed on any computer that can connect to the SharePoint server.
Architecture
In a nutshell, Aquaforest Searchlight Tagger works in 3 main steps:
-
Documents are downloaded from SharePoint to the temporary location defined in Tagger
-
Metadata are extracted or generated from the documents based on the extraction type(s) selected and metadata chosen to be extracted. The extraction types are described in the sections below.
-
The documents are then tagged with the extracted metadata from the previous step. If necessary the metadata are added to the Term Store if they are not already present.
The downloaded documents are deleted after processing.
Taxonomy Matching
Searchlight supports the use of managed metadata and taxonomies for both identifying taxonomy values that should be used to tag the document and is also able to add new taxonomy values if required. Text is extracted from the documents and compared with terms in the Taxonomy Term Store to see if any terms appears in the Text. Only the Terms in the Term Set defined for the selected SharePoint column are compared.
Entity Extraction
By integrating with NLP (Natural Language Processing) services, it is able to assign values for Entities such as Location, Person, Company and more. Text is extracted from the documents and passed to the NLP service defined in Tagger. The NLP service will then analyse the text and automatically identify or generate entities to be used as metadata. Entity Extraction is explained in more detail in section 5.1.
Zonal Extraction
It enables zonal extraction of text and barcodes from PDF documents. Over 20 types of barcode can be recognized and the values assigned to Library metadata columns.
Document Metadata
Both standard and custom PDF metadata can be extracted and assigned to SharePoint columns. This can also include XMP metadata.
PDF Forms
Data from PDF forms can be extracted and each field value assigned to a separate SharePoint column.