Entities

Entities are important expressions, both named (e.g., organizations, cities) and unnamed (e.g., dates). The exact set of supported entities is dependent on the domain.

Entities have:

  • name or standard form – disambiguated and standardized form of the entity. For example, we will return USA for both USA and United States. We will also take care of morphology: returning Německo even when the text contains the form Německu.

  • id – a unique id of the entity in some knowledge base (we support this only in certain domains)

  • link to Geneea Knowledge Base if the domain supports it.

  • type – a string indicating whether the entity is a person or date, see below for a list of types.

  • instances or mentions - the actual mention of the entity in the document

See the Entity object reference page for more information.

Entity types

The standard public workflows support the following entity types:

  • Basic:

    • locationLondon, France

    • organizationUNESCO, IBM

    • personJohn Doe

  • Internet:

    • urlwww.geneea.com

    • emailinfo@geneea.com

    • hashtag#hashtag

    • mention@mention

  • Relations and phrases:

    • verb relation (action + objects) – VERBbuy lunch

    • attribute relation – (attribute + noun) – ATTRdenied credit card

      Note that for relations the text field of each instance contains the structure of the entity, e.g. CLAUSE:attempt(AMOD:First). The format is fnc:lemma(fnc:lemma, ...), where fnc can be any dependency label from Universal Dependencies V1, mainly DOBJ (direct object), AMOD (attribute), CS_REFL_CLITIC (reflexive). For most purposes, you can ignore the first function, which expresses the function of the whole phrase relative to the rest of the sentence.

  • Date and Time:

    Entities can be resolved relative to some point in time (see referenceDate in Request). Standard forms follow the TIMEX3 format.

    • dateSeptember 3 (XXXX-09-03 when unresolved), next Monday, summer of 2015 (2015-SU)

    • time12:03 (YYYY-MM-DDT12:03), tonight (YYYY-MM-DDTNI)

    • duration3 years and 4 days (P3Y4D), 5 minutes (PT5M). Standard form P(n)Y(n)M(n)DT(n)H(n)M(n)S

    • set – set of times/dates – every Monday (XXXX-WXX-1), semiannual (P6M)

  • Numbers:

    • number3; five (words only in English)

    • ordinalthird (only for English)

    • money$40

    • percent5%

In addition, we support many other entity types (food items, colors, means of transport, economic terms, laws, product numbers, …) in custom workflows aimed at particular industries.

We find the entities in a document, we use a combination of machine learning models, rules and lexicons. And as always we can customize all of these.

Sample call

You can easily try it yourself:

curl -X POST https://api.geneea.com/v3/analysis \
-H 'Authorization: user_key <YOUR USER KEY>' \
-H 'Content-Type: application/json' \
-d '{
    "id": "1",
    "text": "The trip to London last summer was great. I also liked Cambridge a lot. ",
    "referenceDate": "2016-02-01",
    "analyses": ["entities"]
}'

# On Windows, use \" instead of " and " instead of '
def callGeneea(input):
    url = 'https://api.geneea.com/v3/analysis'
    headers = {
        'content-type': 'application/json',
        'Authorization': 'user_key <your user key>'
    }

    return requests.post(url, json=input, headers=headers).json()

responseObj = callGeneea({
    'id': '1',
    'text': 'The trip to London last summer was great. I also liked Cambridge a lot. ',
    'referenceDate': '2016-02-01',
    'analyses': ['entities']
)

print(responseObj)

You should get the following response:

{
    "id": "1",
    "language": {"detected": "en"},
    "entities": [
        {"id": "E0", "stdForm": "2015-SU", "type": "date"},
        {"id": "E1", "stdForm": "London", "type": "location"},
        {"id": "E2", "stdForm": "Cambridge", "type": "location"}
    ],
    "usedChars": 100
}

You can use "returnMentions": "true" to return the entity mentions:

curl -X POST https://api.geneea.com/v3/analysis \
-H 'Authorization: user_key <YOUR USER KEY>' \
-H 'Content-Type: application/json' \
-d '{
    "id": "1",
    "text": "The trip to London last summer was great. I also liked Cambridge a lot. ",
    "referenceDate": "2016-02-01",
    "analyses": ["entities"],
    "returnMentions": "true"
}'

# On Windows, use \" instead of " and " instead of '
def callGeneea(input):
    url = 'https://api.geneea.com/v3/analysis'
    headers = {
        'content-type': 'application/json',
        'Authorization': 'user_key <your user key>'
    }

    return requests.post(url, json=input, headers=headers).json()

responseObj = callGeneea({
    'id': '1',
    'text': 'The trip to London last summer was great. I also liked Cambridge a lot. ',
    'referenceDate': '2016-02-01',
    'analyses': ['entities'],
    'returnMentions': True
)

print(responseObj)

In comparison with the previous response, this one contains mentions of the individual entities: their text and reference to the relevant tokens (text, split into paragraphs, sentences and tokens is added automatically).

{
    "id": "1",
    "language": {"detected": "en"},
    "paragraphs": [{
        "id": "P2",
        "type": "BODY",
        "text": "The trip to London last summer was great. I also liked Cambridge a lot. ",
        "corrText": "The trip to London last summer was great. I also liked Cambridge a lot. ",
        "sentences": [{
            "id": "s0",
            "tokens": [
                {"id": "t0", "off": 0, "text": "The", "corrOff": 0, "corrText": "The"},
                {"id": "t1", "off": 4, "text": "trip", "corrOff": 4, "corrText": "trip"},
                {"id": "t2", "off": 9, "text": "to", "corrOff": 9, "corrText": "to"},
                {"id": "t3", "off": 12, "text": "London", "corrOff": 12, "corrText": "London"},
                {"id": "t4", "off": 19, "text": "last", "corrOff": 19, "corrText": "last"},
                {"id": "t5", "off": 24, "text": "summer", "corrOff": 24, "corrText": "summer"},
                {"id": "t6", "off": 31, "text": "was", "corrOff": 31, "corrText": "was"},
                {"id": "t7", "off": 35, "text": "great", "corrOff": 35, "corrText": "great"},
                {"id": "t8", "off": 40, "text": ".", "corrOff": 40, "corrText": "."}]
        }, {
            "id": "s1",
            "tokens": [
                {"id": "t9", "off": 42, "text": "I", "corrOff": 42, "corrText": "I"},
                {"id": "t10", "off": 44, "text": "also", "corrOff": 44, "corrText": "also"},
                {"id": "t11", "off": 49, "text": "liked", "corrOff": 49, "corrText": "liked"},
                {"id": "t12", "off": 55, "text": "Cambridge", "corrOff": 55, "corrText": "Cambridge"},
                {"id": "t13", "off": 65, "text": "a", "corrOff": 65, "corrText": "a"},
                {"id": "t14", "off": 67, "text": "lot", "corrOff": 67, "corrText": "lot"},
                {"id": "t15", "off": 70, "text": ".", "corrOff": 70, "corrText": "."}
            ]
        }]
    }],
    "entities": [
        {"id": "E0", "stdForm": "2015-SU", "type": "date", "mentions": [{"id": "m0", "mwl": "last summer", "text": "last summer", "tokenIds": ["t4", "t5"]}]},
        {"id": "E1", "stdForm": "London", "type": "location", "mentions": [{"id": "m1", "mwl": "London", "text": "London", "tokenIds": ["t3"]}]},
        {"id": "E2", "stdForm": "Cambridge", "type": "location", "mentions": [{"id": "m2", "mwl": "Cambridge", "text": "Cambridge", "tokenIds": ["t12"]}]}
    ],
    "usedChars": 100
}