Semantic Tags

Semantic tags are labels assigned to articles to describe their content. By assigning unique identifiers to entities and topics, our tagger builds a machine-readable base for content discovery and analytics.

See:

The Glossary for a general overview and their importance to media organizations
The Tag Object in our API reference
The Semantic Tagging Guide for a step-by-step guide on implementing semantic tagging

Components of Semantic Tags

The Geneea tagger analyzes content using specialized models. Depending on your configuration, the API returns tags from three main sources: entities, topic categories (taxonomies), and topics.

The table below provides a quick overview of these tag types, while the following subsections offer detailed explanations and links to further documentation.

Source	Description	Examples
Entities
Named Entities	Specific people, locations, organizations, products, and events	Paul McCartney; France; Airbus; Game of Thrones; Christmas; Hurricane Katrina
General Entities	Broad concepts and keywords	global warming; income tax; smartphone
Topic Categories (Taxonomies)
IPTC Media Topics	Hierarchical taxonomy developed by IPTC; mainly used for analytics	Science and Technology; Economy, Business and Finance; Economic Trends and Indicators
IAB Content Taxonomy	Hierarchical taxonomy designed by IAB for marketing purposes	Technology & Computing; Personal Finance; Pet Insurance
IAB Brand Safety Categories	Tags flagging sensitive content to protect brand reputation and ad placement	Adult & Explicit Sexual Content; Terrorism; Debated Sensitive Social Issues
Topics
Geneea Topics	Reader-friendly topic classification	science and technology; economy; culture
Editorial Tags	Custom tags unique to your organization	Important people of our city; Mysterious murder stories

Entities

Entities are specific elements and concepts identified within the article. For more details, see our Entities Guide. They help you understand exactly who or what the content is about. Our tagger detects two main types of entities:

Named Entities: Specific people, locations, organizations, products, and events (e.g., Barack Obama, Paris, Apple Inc., World War II).
General Entities: Broad concepts and keywords that represent meaningful ideas (e.g., electric vehicle, income tax, flu season).

Entities are typically explicitly mentioned in the text. For example, when an article is assigned an Albert Einstein tag, the name Einstein usually appears directly in the text. However, entities can also be assigned indirectly. This is especially common for:

Derived Entities: Entities that are mentioned indirectly through other entities. For instance, an article mentioning French cities or regions is assigned the derived entity France as well.
Geneea Topics: See the section on the fuzzy boundary between entities and topics below.

Topic Categories (Taxonomies)

Topic categories are controlled vocabularies, often organized into stable hierarchical structures. Unlike free-form tagging — where different editors might use varying words for the same concept — categories provide standardized topics with stable identifiers. This is essential for reliable, large-scale content classification and metadata exchange.

Geneea automatically assigns categories from two major industry-standard taxonomies:

IPTC Media Topics: Designed specifically for editorial and journalistic workflows. Maintained by the IPTC, this taxonomy contains over 1,200 topics across five levels. It reflects how newsrooms structure coverage (e.g., Politics and Government → Election → National Elections) and is ideal for editorial analytics, archive organization, and content discovery.
IAB Content Taxonomy: Designed for the digital advertising ecosystem. It provides a shared vocabulary for publishers, ad platforms, and advertisers to align content for targeting and monetization.
IAB Brand Safety Categories: As part of the IAB framework, we evaluate content for brand safety and suitability. Using semantic analysis instead of keyword-blocking, the tagger assesses context and risk across sensitive categories (like Adult Content or Violence). Keyword-based blocking is difficult to maintain and often produces false positives that inadvertently block legitimate content. Semantic analysis provides a more accurate, context-aware classification without these drawbacks.

A category captures the overall meaning of a text; it doesn't require a specific word to be present. For example, an article might be categorized under Tennis without ever explicitly using the word.

We can also provide custom categories mapped to your organization's specific needs.

Topics

While taxonomies rely on a strictly controlled hierarchy, Topics are a more open and flexible set of labels.

Geneea Topics: These provide a broader, reader-friendly classification. While similar in purpose to IPTC or IAB categories, Geneea Topics focus on straightforward, easy-to-digest themes like science and technology, economy, or culture, making them ideal for frontend display and general reader navigation.
Editorial Tags: These are custom, user-defined tags specific to your organization's unique content strategy (e.g., Important people of our city or Mysterious murder stories). Editorial tags are recall-centric. The system aims to offer a wide variety of relevant suggestions. We consider it a success if a journalist accepts one out of every five suggested editorial tags.

Entities vs. Topics

The boundary between entities and topics is fluid. The API handles this overlap seamlessly because both entities and topics link to the same knowledge base, and they share the exact same Geneea Knowledge Base ID (e.g., G847 for tennis).

Depending on how the concept appears in the text, you will see one of three behaviors in the API response:

Entity and Topic: If the article is about tennis and explicitly mentions it, the API returns the tennis tag with a high relevance score, the feats.topic: geneea property, and the mentions included in the mentions array.
Topic Only: If the article is clearly about tennis (e.g., discussing Wimbledon, racquets, and Grand Slams) but never actually uses the word "tennis", the API returns the same tag. It will have the same GKB ID, high relevance score, and feats.topic: geneea property, but the mentions array will be empty.
Entity Only: If the article only mentions tennis in passing (e.g., an article about a city budget that funds a new tennis court among other things), it is treated strictly as an entity. The API returns the tag with a populated mentions array, but with a lower relevance score and without the feats.topic: geneea property, since the article itself is not about tennis.

This shared architecture ensures that your downstream systems can accurately group content by subject matter, regardless of whether the author explicitly named the concept, just described it, or merely mentioned it in passing.

Working with Tags

When implementing Geneea Media Tags, there are a few rules you should follow depending on your use case:

1. Use Geneea Knowledge Base ID

Always rely on the Geneea Knowledge Base ID (gkbId) rather than the standard form (the name) of the tag:

The IDs distinguish between tags with the same name: Brian Cox, the physicist (G463581), the actor (G34975), the film director (G4963453), etc.
The IDs help when the tag's name changes: Prince Charles and Charles III both have the ID G43274.
The IDs link the same tag across languages: “United States,” “Vereinigte Staaten,” and “Estados Unidos” all have the same ID (G30).
The IDs link to the Geneea Knowledge Base, which provides additional information about the entity (its relations to other entities, IDs for other knowledge bases, etc.).

Note that sometimes the ID might change. This happens when we discover duplicate entries in our knowledge base and choose one of them as the primary entry. This guide explains how our API communicates such changes and how to handle them.

2. Use Relevance

If you are passing tags to an analytics dashboard (e.g., to track coverage trends), do not rely on simple mention counting. Our relevance score (the feats.relevance field) is much more reliable: it also considers the position of each mention, indirect mentions, and other factors.

High Relevance: The article is about the topic.
Low Relevance: The topic is mentioned in passing.

See this guide for more information.

3. External Mappings

Many of our semantic tags include external identifiers in the feats object. For example, named entities often contain a Wikidata ID (feats.wikidataId), and IPTC tags map to official IPTC concept URIs (feats.MediaTopicId). Use these fields to seamlessly connect Geneea's output with external knowledge graphs or standardized industry tools.

4. Mentions

Tags based on entities are typically tied to specific words or phrases in the article. Knowing where a tag is mentioned lets you create hyperlinks, highlight relevant passages, or connect tags back to the original text for readers.

The API provides this information in two ways:

Simple Mention Reference — a lightweight option that locates an early occurrence of the tag and returns its surface text, paragraph, and character offset as feats keys. This is usually enough for linking or highlighting and keeps the response compact. See the Tag Simple Mention Features reference.
Full Mentions — a comprehensive list of every occurrence in the text, with token-level detail and normalized forms. Use this when you need to find all mentions or require precise token references. See the mentions field in Tag's reference.

For most use cases — such as adding a hyperlink on first mention — the simple mention reference is sufficient and much simpler to work with.

See the Semantic Tagging guide for implementation details and examples.

Components of Semantic Tags​

Entities​

Topic Categories (Taxonomies)​

Topics​

Entities vs. Topics​

Working with Tags​

1. Use Geneea Knowledge Base ID​

2. Use Relevance​

3. External Mappings​

4. Mentions​