Entities¶
Entities are important expressions, both named (e.g., organizations, cities) and unnamed (e.g., dates). The exact set of supported entities is dependent on the domain.
Entities have:
name or standard form – disambiguated and standardized form of the entity. For example, we will return USA for both USA and United States. We will also take care of morphology: returning Německo even when the text contains the form Německu. Media API V2 can also display the standard form in a specified language (Germany, Deutschland, Německo, etc.)
id – a unique id of the entity in some knowledge base (we support this only in certain domains)
link to Geneea Knowledge Base if the domain supports it.
type – a string indicating whether the entity is a person or date, see below for a list of types.
instances or mentions - the actual mention of the entity in the document
See the Entity object reference page for more information.
Entity types¶
The standard media domains support the following entity types:
Basic:
person
– John Doeorganization
– UNESCO, IBMlocation
– London, Franceproduct
– Skoda Octavia, iPhone 13event
– Brexit, World War IIgeneral
– electric vehicle, trade war
Internet:
url
– geneea.comemail
– info@geneea.comhashtag
– #hashtagmention
– @mention
Date and Time:
Entities can be resolved relative to some point in time (see
referenceDate
in Request). Standard forms follow the TIMEX3 format.date
– September 3 (XXXX-09-03
when unresolved), next Monday, summer of 2015 (2015-SU
)time
– 12:03 (YYYY-MM-DDT12:03
), tonight (YYYY-MM-DDTNI
)duration
– 3 years and 4 days (P3Y4D
), 5 minutes (PT5M
). Standard formP(n)Y(n)M(n)DT(n)H(n)M(n)S
set
– set of times/dates – every Monday (XXXX-WXX-1
), semiannual (P6M
)
Numbers:
number
– 3; five (words only in English)ordinal
– third (only for English)money
– $40percent
– 5%
The standard VoC domains support selected named entities, general entities, industry specific entities (e.g. food for restaurants) and Internet/data/numeric entities.
In addition, we can support many other entity types (colors, means of transport, food items, economic terms, laws, product numbers, …) in custom domains.
We use a combination of machine learning models, rules and lexicons. And as always we can customize all of these.
Sample call¶
You can easily try it yourself:
curl -X POST https://api.geneea.com/v3/analysis \
-H 'Authorization: user_key <YOUR USER KEY>' \
-H 'Content-Type: application/json' \
-d '{
"id": "1",
"text": "The trip to London last summer was great. I also liked Cambridge a lot. ",
"referenceDate": "2016-02-01",
"analyses": ["entities"]
}'
# On Windows, use \" instead of " and " instead of '
from geneeanlpclient import g3
requestBuilder = g3.Request.Builder(analyses=[g3.AnalysisType.ENTITIES])
with g3.Client.create(userKey=<YOUR USER KEY>) as analyzer:
result = analyzer.analyze(requestBuilder.build(
id=str(1),
referenceDate='2016-02-01',
text='The trip to London last summer was great. I also liked Cambridge a lot.'
))
for e in result.entities:
print(f'{e.stdForm}: {e.type}')
import requests
def callGeneea(input):
url = 'https://api.geneea.com/v3/analysis'
headers = {
'content-type': 'application/json',
'Authorization': 'user_key <YOUR USER KEY>'
}
return requests.post(url, json=input, headers=headers).json()
responseObj = callGeneea({
'id': '1',
'text': 'The trip to London last summer was great. I also liked Cambridge a lot. ',
'referenceDate': '2016-02-01',
'analyses': ['entities']
})
print(responseObj)
You should get the following response:
{
"id": "1",
"language": {"detected": "en"},
"entities": [
{"id": "E0", "stdForm": "2015-SU", "type": "date"},
{"id": "E1", "stdForm": "London", "type": "location"},
{"id": "E2", "stdForm": "Cambridge", "type": "location"}
],
"usedChars": 100
}
2015-SU: date
London: location
Cambridge: location
{
"id": "1",
"language": {"detected": "en"},
"entities": [
{"id": "E0", "stdForm": "2015-SU", "type": "date"},
{"id": "E1", "stdForm": "London", "type": "location"},
{"id": "E2", "stdForm": "Cambridge", "type": "location"}
],
"usedChars": 100
}
Mentions and highlighting¶
You can use "returnMentions": "true"
to return the entity mentions:
curl -X POST https://api.geneea.com/v3/analysis \
-H 'Authorization: user_key <YOUR USER KEY>' \
-H 'Content-Type: application/json' \
-d '{
"id": "1",
"text": "The trip to London last summer was great. I also liked Cambridge a lot. ",
"referenceDate": "2016-02-01",
"analyses": ["entities"],
"returnMentions": "true"
}'
# On Windows, use \" instead of " and " instead of '
from geneeanlpclient import g3
requestBuilder = g3.Request.Builder(analyses=[g3.AnalysisType.ENTITIES], returnMentions=True)
with g3.Client.create() as analyzer:
result = analyzer.analyze(requestBuilder.build(
id=str(1),
referenceDate='2016-02-01',
text='The trip to London last summer was great. I also liked Cambridge a lot.'
))
for e in result.entities:
print(f'{e.stdForm}: {e.type}')
for m in e.mentions:
# charSpan can be used for highlighting in the original text
print(f'\t{m.text}; {m.mwl}; {m.tokens.charSpan}')
def callGeneea(input):
url = 'https://api.geneea.com/v3/analysis'
headers = {
'content-type': 'application/json',
'Authorization': 'user_key <your user key>'
}
return requests.post(url, json=input, headers=headers).json()
responseObj = callGeneea({
'id': '1',
'text': 'The trip to London last summer was great. I also liked Cambridge a lot. ',
'referenceDate': '2016-02-01',
'analyses': ['entities'],
'returnMentions': True
})
print(responseObj)
In comparison with the previous response, this one contains mentions of the individual entities: their text and reference to the relevant tokens (text, split into paragraphs, sentences and tokens are added automatically to the response).
{
"id": "1",
"language": {"detected": "en"},
"paragraphs": [{
"id": "P2",
"type": "BODY",
"text": "The trip to London last summer was great. I also liked Cambridge a lot. ",
"corrText": "The trip to London last summer was great. I also liked Cambridge a lot. ",
"sentences": [{
"id": "s0",
"tokens": [
{"id": "t0", "off": 0, "text": "The", "corrOff": 0, "corrText": "The"},
{"id": "t1", "off": 4, "text": "trip", "corrOff": 4, "corrText": "trip"},
{"id": "t2", "off": 9, "text": "to", "corrOff": 9, "corrText": "to"},
{"id": "t3", "off": 12, "text": "London", "corrOff": 12, "corrText": "London"},
{"id": "t4", "off": 19, "text": "last", "corrOff": 19, "corrText": "last"},
{"id": "t5", "off": 24, "text": "summer", "corrOff": 24, "corrText": "summer"},
{"id": "t6", "off": 31, "text": "was", "corrOff": 31, "corrText": "was"},
{"id": "t7", "off": 35, "text": "great", "corrOff": 35, "corrText": "great"},
{"id": "t8", "off": 40, "text": ".", "corrOff": 40, "corrText": "."}]
}, {
"id": "s1",
"tokens": [
{"id": "t9", "off": 42, "text": "I", "corrOff": 42, "corrText": "I"},
{"id": "t10", "off": 44, "text": "also", "corrOff": 44, "corrText": "also"},
{"id": "t11", "off": 49, "text": "liked", "corrOff": 49, "corrText": "liked"},
{"id": "t12", "off": 55, "text": "Cambridge", "corrOff": 55, "corrText": "Cambridge"},
{"id": "t13", "off": 65, "text": "a", "corrOff": 65, "corrText": "a"},
{"id": "t14", "off": 67, "text": "lot", "corrOff": 67, "corrText": "lot"},
{"id": "t15", "off": 70, "text": ".", "corrOff": 70, "corrText": "."}
]
}]
}],
"entities": [
{"id": "E0", "stdForm": "2015-SU", "type": "date", "mentions": [{"id": "m0", "mwl": "last summer", "text": "last summer", "tokenIds": ["t4", "t5"]}]},
{"id": "E1", "stdForm": "London", "type": "location", "mentions": [{"id": "m1", "mwl": "London", "text": "London", "tokenIds": ["t3"]}]},
{"id": "E2", "stdForm": "Cambridge", "type": "location", "mentions": [{"id": "m2", "mwl": "Cambridge", "text": "Cambridge", "tokenIds": ["t12"]}]}
],
"usedChars": 100
}
2015-SU: date
last summer; last summer; CharSpan(start=19, end=30)
London: location
London; London; CharSpan(start=12, end=18)
Cambridge: location
Cambridge; Cambridge; CharSpan(start=55, end=64)
{
"id": "1",
"language": {"detected": "en"},
"paragraphs": [{
"id": "P2",
"type": "BODY",
"text": "The trip to London last summer was great. I also liked Cambridge a lot. ",
"corrText": "The trip to London last summer was great. I also liked Cambridge a lot. ",
"sentences": [{
"id": "s0",
"tokens": [
{"id": "t0", "off": 0, "text": "The", "corrOff": 0, "corrText": "The"},
{"id": "t1", "off": 4, "text": "trip", "corrOff": 4, "corrText": "trip"},
{"id": "t2", "off": 9, "text": "to", "corrOff": 9, "corrText": "to"},
{"id": "t3", "off": 12, "text": "London", "corrOff": 12, "corrText": "London"},
{"id": "t4", "off": 19, "text": "last", "corrOff": 19, "corrText": "last"},
{"id": "t5", "off": 24, "text": "summer", "corrOff": 24, "corrText": "summer"},
{"id": "t6", "off": 31, "text": "was", "corrOff": 31, "corrText": "was"},
{"id": "t7", "off": 35, "text": "great", "corrOff": 35, "corrText": "great"},
{"id": "t8", "off": 40, "text": ".", "corrOff": 40, "corrText": "."}]
}, {
"id": "s1",
"tokens": [
{"id": "t9", "off": 42, "text": "I", "corrOff": 42, "corrText": "I"},
{"id": "t10", "off": 44, "text": "also", "corrOff": 44, "corrText": "also"},
{"id": "t11", "off": 49, "text": "liked", "corrOff": 49, "corrText": "liked"},
{"id": "t12", "off": 55, "text": "Cambridge", "corrOff": 55, "corrText": "Cambridge"},
{"id": "t13", "off": 65, "text": "a", "corrOff": 65, "corrText": "a"},
{"id": "t14", "off": 67, "text": "lot", "corrOff": 67, "corrText": "lot"},
{"id": "t15", "off": 70, "text": ".", "corrOff": 70, "corrText": "."}
]
}]
}],
"entities": [
{"id": "E0", "stdForm": "2015-SU", "type": "date", "mentions": [{"id": "m0", "mwl": "last summer", "text": "last summer", "tokenIds": ["t4", "t5"]}]},
{"id": "E1", "stdForm": "London", "type": "location", "mentions": [{"id": "m1", "mwl": "London", "text": "London", "tokenIds": ["t3"]}]},
{"id": "E2", "stdForm": "Cambridge", "type": "location", "mentions": [{"id": "m2", "mwl": "Cambridge", "text": "Cambridge", "tokenIds": ["t12"]}]}
],
"usedChars": 100
}