Understanding entity extraction and its benefits

Entity extraction is the process of automatically extracting named entities such as people, places, companies, etc. from unstructured contents in documents using Natural Language Processing (NLP).

Say, for example, we have the following text:

US entrepreneur Elon Musk has launched his new rocket, the Falcon Heavy, from the Kennedy Space Center in Florida. The SpaceX CEO said the challenges of developing the new rocket meant the chances of a successful first outing might be only 50-50.

For this experimental and uncertain mission, however, he decided on a much smaller and whimsical payload — his old cherry-red Tesla sports car. A space-suited mannequin was strapped in the driver’s seat, and the radio set to play a David Bowie soundtrack on a loop. The Tesla and its passenger have been dispatched into an elliptical orbit around the Sun that reaches out as far as the Planet Mars.

The Falcon Heavy is essentially three of SpaceX’s workhorse Falcon 9 vehicles strapped together. And, as is the usual practice for SpaceX, all three boost stages — the lower segments of the rocket — returned to Earth to attempt controlled landings. Two came back to touchdown zones on the Florida coast just south of Kennedy. Their landing legs made contact with the ground virtually at the same time.

This is the result of passing it to an NLP service:

NLP entity extraction results showing identified persons like Elon Musk, organizations like SpaceX, locations like Kennedy Space Center, and other entities extracted from the sample text about Falcon Heavy rocket launch

The NLP service automatically identified Person, Organization, Location and Title from the text. If the text had other entity types, they would’ve been extracted too. Without NLP, the identification of these entities would’ve had to be done manually, which isn’t feasible for large numbers of documents in businesses.

The benefits of automated entity extraction for businesses are numerous — from improving the finding of documents through faceted search (by categorizing documents based on the entities) to unlocking valuable business-related information that may otherwise be hidden.

Entity extraction in Tagging

Nutrient Document Searchability Tagging can harness the power of automated entity extraction by using external third-party NLP service providers. To put it briefly, it achieves this by first extracting the text from documents and then sending them over to the NLP service for processing. The results are then sent back to Tagging where they are processed further and eventually added to SharePoint as metadata. Refer to our entity extraction guide for a diagrammatic representation of this.

In the current version, the following NLP services are supported:

At the time of writing, all of the above NLP services offer free usage of their service. However, they come with certain restrictions as shown below.

NLP service (free version)Max API callsText limit per call
Rosette10,000 calls per month, 1,000 calls per day600KB (50,000 characters), more info(opens in a new tab)
Open Calais5,000 calls per day100KB
Microsoft Cognitive Services
Google Natural Language5,000 calls per month1,000 characters, more info(opens in a new tab)

Text limit

Since the free versions of each NLP service restrict the amount of text they can process at any one time, before sending a document’s contents to the NLP service, Tagging splits them into chunks of 50,000 characters. From our test, this seems to work for most NLP services currently supported. However, you can increase this value if you’ve purchased their premium service. The following setting in Tagging under Job -> Metadata -> NLP Settings controls this:

Text limit setting in Tagging NLP Settings showing configuration field to set maximum character limit per chunk sent to NLP services, defaulted to 50,000 characters

API calls

For every chunk that is sent to the NLP service, one API call is consumed. You should schedule and limit the amount of documents processed to avoid going over the limit based on the selected NLP service.

Entities

Each NLP service has its own entities that can be extracted. Tagging has the most common ones for each service.

NLP serviceDefault entities in Tagging
RosetteLOCATION, ORGANIZATION, PERSON, CONCEPTS, KEYPHRASES
Open CalaisCountry, Company, Person
Microsoft Cognitive ServicesKeywords
Google Natural LanguageLOCATION, ORGANIZATION, PERSON

To view the NLP entities currently defined in Tagging, go to Settings > Enums tab.

NLP entities configuration screen in Tagging Settings Enums tab showing list of available entity types like LOCATION, ORGANIZATION, PERSON for different NLP services

To add new entities:

  1. Select the NLP Service for which you want to add entities.

NLP service selection dropdown showing different entity extraction service options

  1. Click on the Add icon button.
  2. A popup dialog will appear. Enter entity name(s).

Dialog box for adding new entity names with text input field

You can add multiple entities by separating each new entity by a comma or a new line. Click the Add button after specifying all the new entity names.

Confirmation dialog showing newly added entity names ready to be saved to the NLP service configuration in Tagging

Now these entities will be available for selection under Job > Metadata > NLP Settings.

NLP Settings dropdown menu showing available entity types that can be selected for extraction during document processing

Another way to add a new entity is to type it in directly in the dropdown menu.

Entity selection dropdown with option to type new entity names directly

You can also delete any unused entities. Select the entity(ies) you want to delete and click on the Delete icon button.

Entity deletion confirmation screen showing selected entities that will be removed from the NLP service configuration

Generating API keys

To extract entities from documents in Tagging, you need to create a free account with the NLP service you wish to use and generate an API key.

  1. Go to Job > Metadata > NLP Settings.
  2. Select the NLP service you want to use.
  3. Click on Don’t have a token icon link next Token/API Key to access NLP Service.

This will open the registration page for the selected NLP service in your default web browser. Complete the signup process:

  1. Rosette: developer.rosette.com/signup(opens in a new tab)

    Rosette API registration page with signup form fields

  2. Open Calais: opencalais-api(opens in a new tab)

  3. Microsoft Cognitive Service: project-entity-linking(opens in a new tab)

  4. Google Natural Language: console.cloud.google.com/freetrial(opens in a new tab)

  5. Once you’ve received the API key, enter it in the Token/API Key to access NLP Service textbox in Tagging.

Entity extraction demo

To test if the API key is valid and working, click on the Demo button under Job > Metadata > NLP Settings.

Demo button location in NLP Settings interface allowing users to test entity extraction functionality with sample documents

Entity extraction demo dialog window with options to select NLP service, enter API key, choose sample file, and run the extraction test

  1. Select the NLP service you want to demo.

    NLP service selection dropdown in demo interface showing available options like Rosette, Open Calais, Microsoft Cognitive Services, and Google Natural Language

  2. Enter the API key to access the selected NLP service. If you don’t have an API key, generate one.

  3. Select a sample file to use for the demo.

  4. Click the Run button and wait for the NLP service to return the results.

Demo results interface showing successful entity extraction with sample file loaded and entities being processed by the selected NLP service

  1. The Formatted Output tab shows the extracted entities after Tagging has formatted them.
    • To view the raw output as returned by the NLP service, click on the Raw Output tab
    • To view the text (from the document) that was sent to the NLP service, click on the Document Text tab

Formatted Output tab in demo results displaying extracted entities organized by type (persons, organizations, locations) in a structured format after Tagging processing

Raw Output tab showing the unprocessed JSON response directly returned by the NLP service before Tagging formats the extracted entities

Document Text tab displaying the original text content that was extracted from the document and sent to the NLP service for entity extraction

Entity extraction results showing all extracted entities with their names highlighted in red, demonstrating the various types of entities the NLP service can identify from document content

When using the demo, all entities supported by the NLP service are retrieved. This can be useful if you want to extract entities that are not part of the default ones provided and do not know the names of the other entities.

To view all the entities extracted from the document go to the Formatted Output tab.

The names of the entities are shown in red in the image. To add them:

  1. Make a note of the ones you want to add.
  2. Close the Demo window.
  3. Go to Settings -> Enum tab.
  4. Add them by following the instructions explained in this section.

Running the demo will also use up your API calls. So, be careful not to demo too many times and make sure to limit the number of chunks that is processed if you are testing large documents because they will be split into chunks and each chunk will consume one API call.