Skip to main content

Entities

Entities are important expressions, both named (e.g., organizations, cities) and unnamed (e.g., dates). The exact set of supported entities is dependent on the domain.

Entities have:

  • name or standard form - disambiguated and standardized form of the entity. For example, we will return USA for both USA and United States. We will also take care of morphology: returning Německo even when the text contains the form Německu. Media API V2 can also display the standard form in a specified language (Germany, Deutschland, Německo, etc.)
  • id - a unique id of the entity in some knowledge base (we support this only in certain domains)
  • link to Geneea Knowledge Base if the domain supports it.
  • type - a string indicating whether the entity is a person or date, see below for a list of types.
  • instances or mentions - the actual mention of the entity in the document

See the Entity object reference page for more information.

Entity types

The standard media domains support the following entity types:

  • Basic:

    • person - John Doe
    • organization - UNESCO, IBM
    • location - London, France
    • product - Skoda Octavia, iPhone 13
    • event - Brexit, World War II
    • general - electric vehicle, trade war
  • Internet:

    • url - geneea.com
    • email - info@geneea.com
    • hashtag - #hashtag
    • mention - @mention
  • Date and Time:

    Entities can be resolved relative to some point in time (see referenceDate in Request). Standard forms follow the TIMEX3 format.

    • date - September 3 (XXXX-09-03 when unresolved), next Monday, summer of 2015 (2015-SU)
    • time - 12:03 (YYYY-MM-DDT12:03), tonight (YYYY-MM-DDTNI)
    • duration - 3 years and 4 days (P3Y4D), 5 minutes (PT5M). Standard form P(n)Y(n)M(n)DT(n)H(n)M(n)S
    • set - set of times/dates - every Monday (XXXX-WXX-1), semiannual (P6M)
  • Numbers:

    • number - 3; five (words only in English)
    • ordinal - third (only for English)
    • money - $40
    • percent - 5%

The standard VoC domains support selected named entities, general entities, industry specific entities (e.g., food for restaurants) and Internet/data/numeric entities.

In addition, we can support many other entity types (colors, means of transport, food items, economic terms, laws, product numbers, ...) in custom domains.

We use a combination of machine learning models, rules and lexicons. And as always we can customize all of these.

Sample call

You can easily try it yourself:

curl -X POST https://api.geneea.com/v3/analysis \
-H 'Authorization: user_key <YOUR USER KEY>' \
-H 'Content-Type: application/json' \
-d '{
"id": "1",
"text": "The trip to London last summer was great. I also liked Cambridge a lot. ",
"referenceDate": "2016-02-01",
"analyses": ["entities"]
}'

## On Windows, use \" instead of " and " instead of '

You should get the following response:

{
"id": "1",
"language": {"detected": "en"},
"entities": [
{"id": "E0", "stdForm": "2015-SU", "type": "date"},
{"id": "E1", "stdForm": "London", "type": "location"},
{"id": "E2", "stdForm": "Cambridge", "type": "location"}
],
"usedChars": 100
}

Mentions and highlighting

You can use "returnMentions": "true" to return the entity mentions:

curl -X POST https://api.geneea.com/v3/analysis \
-H 'Authorization: user_key <YOUR USER KEY>' \
-H 'Content-Type: application/json' \
-d '{
"id": "1",
"text": "The trip to London last summer was great. I also liked Cambridge a lot. ",
"referenceDate": "2016-02-01",
"analyses": ["entities"],
"returnMentions": "true"
}'

## On Windows, use \" instead of " and " instead of '

In comparison with the previous response, this one contains mentions of the individual entities: their text and reference to the relevant tokens (text, split into paragraphs, sentences and tokens are added automatically to the response).

{
"id": "1",
"language": {"detected": "en"},
"paragraphs": [{
"id": "P2",
"type": "BODY",
"text": "The trip to London last summer was great. I also liked Cambridge a lot. ",
"corrText": "The trip to London last summer was great. I also liked Cambridge a lot. ",
"sentences": [{
"id": "s0",
"tokens": [
{"id": "t0", "off": 0, "text": "The", "corrOff": 0, "corrText": "The"},
{"id": "t1", "off": 4, "text": "trip", "corrOff": 4, "corrText": "trip"},
{"id": "t2", "off": 9, "text": "to", "corrOff": 9, "corrText": "to"},
{"id": "t3", "off": 12, "text": "London", "corrOff": 12, "corrText": "London"},
{"id": "t4", "off": 19, "text": "last", "corrOff": 19, "corrText": "last"},
{"id": "t5", "off": 24, "text": "summer", "corrOff": 24, "corrText": "summer"},
{"id": "t6", "off": 31, "text": "was", "corrOff": 31, "corrText": "was"},
{"id": "t7", "off": 35, "text": "great", "corrOff": 35, "corrText": "great"},
{"id": "t8", "off": 40, "text": ".", "corrOff": 40, "corrText": "."}]
}, {
"id": "s1",
"tokens": [
{"id": "t9", "off": 42, "text": "I", "corrOff": 42, "corrText": "I"},
{"id": "t10", "off": 44, "text": "also", "corrOff": 44, "corrText": "also"},
{"id": "t11", "off": 49, "text": "liked", "corrOff": 49, "corrText": "liked"},
{"id": "t12", "off": 55, "text": "Cambridge", "corrOff": 55, "corrText": "Cambridge"},
{"id": "t13", "off": 65, "text": "a", "corrOff": 65, "corrText": "a"},
{"id": "t14", "off": 67, "text": "lot", "corrOff": 67, "corrText": "lot"},
{"id": "t15", "off": 70, "text": ".", "corrOff": 70, "corrText": "."}
]
}]
}],
"entities": [
{"id": "E0", "stdForm": "2015-SU", "type": "date", "mentions": [{"id": "m0", "mwl": "last summer", "text": "last summer", "tokenIds": ["t4", "t5"]}]},
{"id": "E1", "stdForm": "London", "type": "location", "mentions": [{"id": "m1", "mwl": "London", "text": "London", "tokenIds": ["t3"]}]},
{"id": "E2", "stdForm": "Cambridge", "type": "location", "mentions": [{"id": "m2", "mwl": "Cambridge", "text": "Cambridge", "tokenIds": ["t12"]}]}
],
"usedChars": 100
}