Skip to main content

Entities

Entities are names (person, location, organization, product, event) and general concepts (also called keywords; e.g., flu season, income tax) mentioned in the article.

Entity Information

The information about each detected entity includes (JSON field names are in parentheses):

  • GKB ID (gkbId): The ID of the entity in our Knowledge Base (GKB), typically in its generic bucket (e.g., G145 for Britain). This ID is present for all entities except so-called synthetic entities (see below).
  • Standard form (stdForm): A standard name of the entity in a particular language. For example, the standard form G145 is: United Kingdom in English, Spojené království in Czech, Royaume-Uni in French, Storbritannien in Danish. The standard forms do not need to be direct translations of each other, although they typically are.
  • Type (type): One of the entity types listed below.
  • Mentions (mentions): If configured, we provide each mention of the entity in the text. For example, G145 might be mentioned as United Kingdom, UK, Britain, Great Britain and Northern Ireland, etc., in an English text. For each mention, we provide:
    • References to the article text (tokenIds) – useful for providing hypelinks to tag pages, for example.
    • The normalized form of the mention (mwl). This abstracts away the morphological peculiarities of the particular language. For example, in Czech, this field will contain Spojené království (United Kingdom in English), even if the actual mention might be Spojeného království, Spojenému království, etc., depending on the grammatical function of the entity in the sentence.
  • Relevance score (feats.relevance): A number between 0 and 100 that indicates the importance of the entity in relation to the text (e.g., if the article is about Britain, the relevance of the G145 entity will be high; if Britain is only briefly mentioned, it will be low).
  • Other features (based on configuration; under the feats key):
    • Wikidata ID – when available
    • Social media links and other GKB properties

See the Entity object reference page for more information.

Entity Types

The standard configuration supports the following entity types:

  • person - John Doe
  • organization - UNESCO, IBM
  • location - London, France
  • product - Skoda Octavia, iPhone 13
  • event - Brexit, World War II
  • general - electric vehicle, trade war, flu season, income tax

Out of the box, we also support detection of dates, currencies, and other numeric expressions, which can be enabled on request. Geneea can also add support for detection of custom types (e.g., colors, food items, economic terms, laws, product numbers, etc.).

Derived Entities

Derived entities are not directly mentioned in the text but are related to something that is. For example, if a text mentions Prague, we also infer that it is about the Czech Republic. In this case, Prague is a direct entity, while Czech Republic is a derived entity. We can infer this because we store information about cities and their countries in GKB.

Derived entities inherit all mentions of the original entity. So in the example above, Prague would also count as a (derived) mention of the Czech Republic.

An entity can have both a direct and a derived mention. For example, in the sentence “Prague is the capital of the Czech Republic,” the entity Czech Republic has two mentions: a direct one (Czech Republic) and a derived one (Prague).

We support a selection of derived entity types (e.g., “manufacturer", "industry", "country", "region", "district", "city", "cityPart"). We add coverage as needed; therefore, not all markets are covered equally.

Synthetic Entities

We strive to have all entities in GKB, but this is not always possible. Entities that are reliably detected in the text but do not have a corresponding GKB entry are called synthetic entities. We do not return their Wikidata ID, social media links, etc. Their standard form is based on various grammatical heuristics. Generally, synthetic entities are of lower quality than GKB-linked entities. Therefore, for most use cases, they are not recommended.