Understanding entity extraction and its benefits
Entity extraction is the process of automatically extracting named entities such as people, place, companies, etc. from unstructured contents in documents using Natural Language Processing (NLP).
Say, for example, we have the following text:
US entrepreneur Elon Musk has launched his new rocket, the Falcon Heavy, from the Kennedy Space Center in Florida. The SpaceX CEO said the challenges of developing the new rocket meant the chances of a successful first outing might be only 50-50.For this experimental and uncertain mission, however, he decided on a much smaller and whimsical payload - his old cherry-red Tesla sports car. A space-suited mannequin was strapped in the driver’s seat, and the radio set to play a David Bowie soundtrack on a loop. The Tesla and its passenger have been despatched into an elliptical orbit around the Sun that reaches out as far as the Planet Mars.The Falcon Heavy is essentially three of SpaceX’s workhorse Falcon 9 vehicles strapped together. And, as is the usual practice for SpaceX, all three boost stages - the lower segments of the rocket - returned to Earth to attempt controlled landings. Two came back to touchdown zones on the Florida coast just south of Kennedy. Their landing legs made contact with the ground virtually at the same time. |
---|
This is the result of passing it to an NLP service:
The NLP service automatically identified Person, Organization, Location and Title from the text. If the text had other entity types, they would have been extracted too. Without NLP, the identification of these entities would have to have been done manually, which is not feasible for large number of documents in businesses.
The benefits of automated entity extraction for businesses are innumerable – from improving the finding of documents through faceted search (by categorising documents based on the entities) to unlocking valuable business related information that may otherwise be ‘hidden’.
Entity Extraction in Tagging
Nutrient Document Searchability Tagging is able to easily harness the power of automated entity extraction by using external third-party NLP service providers. To put it briefly, it is able to achieve this by first extracting the text from documents and then sending them over to the NLP service for processing. The results are then sent back to Tagging where they are processed further and eventually added to SharePoint as metadata. See entity extraction for a diagrammatic representation of this.
In the current version, the following NLP services are supported:
At the time of writing, all of the above NLP services offer free usage of their service. However, they come with certain restrictions as shown below.
NLP Service (free version) | Max API Calls | Text limit per call | |
---|---|---|---|
Rosette | 10,000 calls per month, 1,000 calls per day | 600KB (50,000 characters) | more info |
Open Calais | 5,000 calls per day | 100KB | |
Microsoft Cognitive Services | |||
Google Natural Language | 5,000 calls per month | 1,000 characters | more info |
Text Limit
Since, the free versions of each NLP service restricts the amount of text it can process at any one time, before sending a document’s contents to the NLP service, Tagging split them in chunks of 50,000 characters. From our test, this seems to work for most NLP services currently supported. However, you can increase this value if you purchase their premium service. The following setting in Tagging under Job > Metadata > NLP Settings controls this:
API Calls
For every chunk that is sent to the NLP service, 1 API call is consumed. You should schedule and limit the amount of documents processed to avoid going over the limit based on the selected NLP service.
Entities
Each NLP service has its own entities that can be extracted. Tagging has the most common ones for each service.
NLP Service | Default Entities in Tagging | Additional entity types |
---|---|---|
Rosette | LOCATION, ORGANIZATION, PERSON, CONCEPTS, KEYPHRASES | more info |
Open Calais | Country, Company, Person | additionalcontactdetails, industry, socialtags, topic more info |
Microsoft Cognitive Services | Keywords | |
Google Natural Language | LOCATION, ORGANIZATION, PERSON |
To view the NLP entities currently defined in Tagging, go to Settings > Enums tab.
To add new entities:
1. Select the NLP Service for which you want to add entities.
2. Click on the button.
3. A popup dialog will appear. Enter entity name(s).
You can add multiple entities by separating each new entity by a comma or a new line. Click the Add button after specifying all the new entity names.
Now these entities will be available for selection under Job > Metadata > NLP Settings.
Another way to add a new entity is just type it in directly in the drop down menu.
You can also delete any unused entities. Select the entity(ies) you want to delete and click on the button.
Generating API keys
In order to be able to extract entities from documents in Tagging, you need to create a free account with the NLP service you wish to use and generate an API key.
1. Go to Job > Metadata > NLP Settings.
2. Select the NLP service you want to use.
3. Click on link next Token/API Key to access NLP Service.
This will open the registration page for the selected NLP service in your default web browser. Complete the signup process:
a. Rosette:https://developer.rosette.com/signup
b. Open Calais:
http://www.opencalais.com/opencalais-api/
c. Microsoft Cognitive Service
https://labs.cognitive.microsoft.com/en-us/project-entity-linking
d. Google Natural Language: https://console.cloud.google.com/freetrial
4. Once you receive the API key, enter it in the Token/API Key to access NLP Service textbox in Tagging.
Entity Extraction Demo
To quickly test if the API key is valid and working, click on the Demo button under Job > Metadata > NLP Settings.
1. Select the NLP service you want to demo.
2. Enter the API key to access the selected NLP service. If you do not have an API key, generate one.
3. Select a sample file to use for the demo.
4. Click the Run button and wait for the NLP service to return the results.
5. The Formatted Output tab shows the extracted entities after Tagging has formatted them.
-
To view the raw output as returned by the NLP service, click on Raw Output tab- To view the text (from the document) that was sent to the NLP service, click on the Document Text tab.
When using the demo, all entities supported by the NLP service are retrieved. This can be useful if you want to extract entities that are not part of the default ones provided and do not know the names of the other entities.
To view all the entities extracted from the document go to the Formatted Output tab.
The names of the entities are shown in red in the image. To add them:
-
Make a note of the ones you want to add.
-
Close the Demo window.
-
Go to Settings > Enum tab.
-
Add them by following the instructions here.
NOTE: Running the demo will also use up your API calls. So, be careful not to demo too many times and make sure to limit the number of chunks that is processed if you are testing large documents because they will be split into chunks and each chunk will consume one API call.