Skip to main content

Entities

Entities are elements explicitly mentioned in the article. They include:

  • Names of people, locations, organizations, products, and events (e.g., Barack Obama, Paris, Apple Inc., iPhone 13, Olympic Games)
  • General concepts, sometimes referred to as keywords, which may not refer to a specific named item but represent meaningful ideas or topics—such as flu season, electric vehicle, or income tax.

These entities are identified to help you understand what the article is about and to support taksks like tagging, linking, or analysis.

Entity Information

The following information is provided for each detected entity (JSON field names are in parentheses):

  • GKB ID (gkbId): The entity's identifier in our Geneea Knowledge Base (GKB), typically in its generic bucket (e.g., G145 for Britain). This field is missing for synthetic entities (see below).

  • Standard Form (stdForm): The standard/canonical name of the entity in the relevant language. For example, the standard form for G145 is: United Kingdom in English Spojené království in Czech Royaume-Uni in French Storbritannien in Danish While typical translations, standard forms do not need to be direct equivalents.

  • Type (type): One of the supported entity types listed below.

  • Mentions (mentions): If enabled, we return all mentions of the entity within the text. For example, G145 might appear as United Kingdom, UK, Britain, Great Britain and Northern Ireland in English. For each mention of an entity, we provide the following information:

    • References to the article text (tokenIds): These are references to specific tokens in the article where the entity is mentioned. This allows you to link back to the exact positions in the text—useful, for example, if you want to highlight or hyperlink entity mentions to tag or detail pages.

    • Normalized form of the mention (mwl): This abstracts away the language-specific morphological variations. For instance, in Czech, the entity Spojené království (United Kingdom in English) may appear in several different grammatical cases—e.g., Spojeného království, Spojenému království, etc.—depending on the context and grammatical function of the entity in the sentence.

      The mvl field always contains the base form: Spojené království. This normalization helps with consistent tagging, grouping, and analysis across inflected forms.

  • Relevance Score (feats.relevance): A numeric value between 0 and 100 that indicates how central the entity is to the meaning of the text. For example, if the article is primarily about Britain, the relevance score of the G145 entity (United Kingdom) will be high. If Britain is mentioned only briefly or in passing, the score will be much lower. This helps to distinguish main topics from incidental references.

  • Other Features (depending on configuration, under the feats key):

    • Wikidata ID – when available, providing a link to structured knowledge in Wikidata
    • Social media handles, Wikipedia links – e.g., for public figures or organizations
    • Other metadata from our internal GKB

    These properties are especially useful for enriching the entity with external context or linking it to structured datasets.

See the Entity object reference page for more details.

Entity Types

The standard configuration includes the following entity types:

  • personJohn Doe
  • organizationUNESCO, IBM
  • locationLondon, France
  • productSkoda Octavia, iPhone 13
  • eventBrexit, World War II
  • generalelectric vehicle, trade war, flu season, income tax

In addition, we support detection of numeric and temporal expressions such as dates, currencies, and amounts.

We can also enable detection of custom entity types tailored to your use case—for example:
colors, food items, economic terms, laws, product numbers, etc.

Derived Entities

Derived entities are not explicitly mentioned in the text but are logically related to those that are.

For example, if an article mentions Prague, we infer that it is also about the Czech Republic, based on geographic relationships stored in our Knowledge Base.
Here:

  • Prague is a direct entity
  • Czech Republic is a derived entity

Derived entities inherit all mentions of the original (direct) entity. So, in the example above, every mention of Prague would also count as a (derived) mention of the Czech Republic.

It's also possible for an entity to have both a direct and derived mention in the same sentence.
For example, in the sentence “Prague is the capital of the Czech Republic,” the entity Czech Republic has two mentions:

  • A direct mention: Czech Republic
  • A derived mention: Prague

We currently support a selection of derived entity types such as:

  • manufacturer
  • industry
  • country, region, district, city, cityPart

Coverage varies depending on geography and data availability.

Synthetic Entities

While we aim to include all entities in the GKB, this is not always possible. Entities that are reliably detected in the text but lack a corresponding GKB entry are called synthetic entities.

Characteristics of synthetic entities:

  • They do not have GKB identifiers, Wikidata IDs, or social media links.
  • Their standard form is generated from the mention using grammatical heuristics.
  • Quality may be lower due to lack of structured data.

In general, GKB-linked entities are preferred. Synthetic entities are useful for completeness but are not recommended for most production use cases, especially where accuracy or linkage is important.