Entities
Entities are names (person, location, organization, product, event) and general concepts (also called keywords; e.g., flu season, income tax) mentioned in the article.
Entity Information
The information about each detected entity includes (JSON field names are in parentheses):
- GKB ID (
gkbId
): The ID of the entity in our Knowledge Base (GKB), typically in its generic bucket (e.g.,G145
for Britain). This ID is present for all entities except so-called synthetic entities (see below). - Standard form (
stdForm
): A standard name of the entity in a particular language. For example, the standard formG145
is:United Kingdom
in English,Spojené království
in Czech,Royaume-Uni
in French,Storbritannien
in Danish. The standard forms do not need to be direct translations of each other, although they typically are. - Type (
type
): One of the entity types listed below. - Mentions (
mentions
): If configured, we provide each mention of the entity in the text. For example,G145
might be mentioned as United Kingdom, UK, Britain, Great Britain and Northern Ireland, etc., in an English text. For each mention, we provide:- References to the article text (
tokenIds
) – useful for providing hypelinks to tag pages, for example. - The normalized form of the mention (
mwl
). This abstracts away the morphological peculiarities of the particular language. For example, in Czech, this field will containSpojené království
(United Kingdom in English), even if the actual mention might be Spojeného království, Spojenému království, etc., depending on the grammatical function of the entity in the sentence.
- References to the article text (
- Relevance score (
feats.relevance
): A number between 0 and 100 that indicates the importance of the entity in relation to the text (e.g., if the article is about Britain, the relevance of theG145
entity will be high; if Britain is only briefly mentioned, it will be low). - Other features (based on configuration; under the
feats
key):- Wikidata ID – when available
- Social media links and other GKB properties
See the Entity object reference page for more information.
Entity Types
The standard configuration supports the following entity types:
person
- John Doeorganization
- UNESCO, IBMlocation
- London, Franceproduct
- Skoda Octavia, iPhone 13event
- Brexit, World War IIgeneral
- electric vehicle, trade war, flu season, income tax
Out of the box, we also support detection of dates, currencies, and other numeric expressions, which can be enabled on request. Geneea can also add support for detection of custom types (e.g., colors, food items, economic terms, laws, product numbers, etc.).
Derived Entities
Derived entities are not directly mentioned in the text but are related to something that is.
For example, if a text mentions Prague, we also infer that it is about the Czech Republic.
In this case, Prague
is a direct entity, while Czech Republic
is a derived entity.
We can infer this because we store information about cities and their countries in GKB.
Derived entities inherit all mentions of the original entity.
So in the example above, Prague
would also count as a (derived) mention of the Czech Republic
.
An entity can have both a direct and a derived mention. For example, in the sentence “Prague is the capital of the Czech Republic,” the entity Czech Republic has two mentions: a direct one (Czech Republic) and a derived one (Prague).
We support a selection of derived entity types (e.g., “manufacturer", "industry", "country", "region", "district", "city", "cityPart"). We add coverage as needed; therefore, not all markets are covered equally.
Synthetic Entities
We strive to have all entities in GKB, but this is not always possible. Entities that are reliably detected in the text but do not have a corresponding GKB entry are called synthetic entities. We do not return their Wikidata ID, social media links, etc. Their standard form is based on various grammatical heuristics. Generally, synthetic entities are of lower quality than GKB-linked entities. Therefore, for most use cases, they are not recommended.