Skip to main content

Entities

Summary definition

Entities are uniquely identifiable objects and concepts mentioned in text, such as people, locations, organizations, products, or events. They can be real or fictional. In the sentence Harry Potter traveled to London, the entities are Harry Potter and London.

Detailed definition

In natural language processing, an entity represents something that can be clearly, uniquely identified and named, whether it is real, fictional, historical, or conceptual. Typical examples include people (e.g., Angela Merkel, Sherlock Holmes), organizations (Airbus, United Nations), locations (Prague), products (iPhone), or events (The World Economic Forum).

The process of their detection in text is called Named Entity Recognition (NER).

Entities are an important way of turning unstructured text into structured data. They are the building blocks of structured information extracted from text. Once detected, they can be tracked, linked to databases, grouped, analyzed, and connected across large content collections.

In news media, entities help answer questions such as:

  • Who is being written about?
  • Which organizations or places are mentioned most often?
  • How does coverage of a specific person or company evolve over time?
  • Which articles should be recommended to a user interested in "NASA"?

Being a type of semantic tag, they can be also used for some of those things tags are used for – see a separate article.

To make entities usable at scale, modern systems usually apply additional processes such as normalization, linking, and relevance scoring described briefly below.

Entity Types

Entities are commonly classified into categories to help machines understand what “kind” of thing they are dealing with. Common standard types include:

  • Person: Real or fictional people.
  • Organization: Companies, governments, NGOs, sports teams.
  • Location: Cities, countries, mountains, rivers, public spaces, planets.
  • Product: Books, movies, vehicles, electronics, drugs.
  • Event: Wars, hurricanes, sports tournaments, festivals.

Some systems define dozens of types or use a hierarchical structure. It is also common to treat Numeric Expressions (dates, money, percentages) as entity types, although they function differently than named objects. Domains like law or medicine often require specialized types (e.g., laws, diseases).

Assigning types to entities is not always straightforward. Classification often depends on context or even personal preference:

  • Is National Theatre an organization or a location?
  • The National Theatre hired a new director.
  • Let's meet in front of the National Theatre.
  • Is EU an organization or a country (location)?
  • Last month, the EU voted on the law.
  • Last month, she traveled throughout the EU.
  • Is R2D2 a product or a fictional person?

Entity normalization

Entity normalization means that all different names, variants or spellings referring to the same real-world entity are assigned a single designated standard form (canonical name).

For example:

  • United Kingdom, U.K., and Britain all refer to the same entity.
  • Charles III, Prince Charles, and Charles Philip Arthur George Mountbatten-Windsor also refer to the same person

Without normalization, analytics and search would fragment across multiple names. Normalization allows systems to store and display a single standard form for each entity, regardless of how it appears in the text.

Entity linking

Entity linking connects a detected entity mention in text to a unique identifier (ID) representing that entity in a knowledge base (such as Wikidata or a publisher's internal database).

This step goes beyond entity normalization by addressing two problems:

  • Ambiguity (one name, different things): It distinguishes between Georgia (the U.S. state) and Georgia (the European country).
  • Cross-language consistency: It connects the entity to a stable ID that works even if the name changes completely across languages (e.g., Wikidata’s Q145 for United Kingdom in English, Royaume-Uni in French, and Spojené království in Czech).

By assigning a stable ID, entities can be reliably referenced across articles, languages, and systems (such as a CMS or DAM), and linked to internal or external knowledge bases to retrieve additional information.

Obviously, every knowledge base is incomplete. So an entity linking system must decide how to handle “less-famous” entities that do not have a knowledge base ID or entities that do not have it yet.

Entity relevance

Entity relevance (or salience) expresses how important an entity is within a specific article, not just whether it appears.

For example:

  • An entity mentioned in the headline and repeatedly throughout the article is highly relevant.
  • An entity mentioned once in passing may have low relevance.

Frequently mentioned entities are usually more relevant, but not always. The place and manner in which an entity is mentioned is also important. For example, in a news report, a tweeting platform is usually not very relevant to the story itself.

Relevance scoring helps determine:

  • which entities should become visible tags,
  • which entities should be highlighted to readers,
  • and which entities matter most for content analytics and recommendations.

Relevance vs. Confidence: It is important not to confuse relevance (how important the entity is to the article) with confidence (how sure the AI is that it detected the entity correctly). An AI can be 99% confident that it found the word Facebook, but that entity might have only 1% relevance if it was mentioned in a footer link.

Geneea context

Geneea automatically detects and disambiguates entities in news articles and applies normalization, linking, and relevance scoring as part of its semantic analysis.

  • Hybrid Knowledge Base: Geneea links entities to a knowledge base that combines public knowledge bases (like Wikidata) with the publisher's custom entities. The knowledge base provides stable IDs and standard form. The publisher can change the standard form on their side (e.g. for editorial or legal reasons) without breaking the system, as the entities will remain linked via the ID.
  • Relevance: Geneea assigns a relevance score to every detected entity. While not every entity is automatically promoted to a visible tag, this score allows publishers to set thresholds – automatically filtering out "noise" so only significant entities are used for navigation or analytics.

This allows publishers to work with entity data that is accurate, consistent, and meaningful, supporting use cases such as content enrichment, analytics, search, recommendations, and brand safety. This makes it possible to track which people, companies, or locations are being written about and to build rich profiles and analytics on top of editorial content.