Semantic Tags
Semantic tags are labels assigned to articles to describe their content. By assigning unique identifiers to entities and topics, our tagger builds a machine-readable base for content discovery and analytics.
See:
- The Glossary for a general overview and their importance to media organizations
- The Tag Object in our API reference
- The Semantic Tagging Guide for a step-by-step guide on implementing semantic tagging
Components of Semantic Tags
The Geneea tagger analyzes content using specialized models. Depending on your configuration, the API returns tags from three main sources: entities, topic categories (taxonomies), and topics.
The table below provides a quick overview of these tag types, while the following subsections offer detailed explanations and links to further documentation.
| Source | Description | Examples |
|---|---|---|
| Entities | ||
| Named Entities | Specific people, locations, organizations, products, and events | Paul McCartney; France; Airbus; Game of Thrones; Christmas; Hurricane Katrina |
| General Entities | Broad concepts and keywords | global warming; income tax; smartphone |
| Topic Categories (Taxonomies) | ||
| IPTC Media Topics | Hierarchical taxonomy developed by IPTC; mainly used for analytics | Science and Technology; Economy, Business and Finance; Economic Trends and Indicators |
| IAB Content Taxonomy | Hierarchical taxonomy designed by IAB for marketing purposes | Technology & Computing; Personal Finance; Pet Insurance |
| IAB Brand Safety Categories | Tags flagging sensitive content to protect brand reputation and ad placement | Adult & Explicit Sexual Content; Terrorism; Debated Sensitive Social Issues |
| Topics | ||
| Geneea Topics | Reader-friendly topic classification | science and technology; economy; culture |
| Editorial Tags | Custom tags unique to your organization | Important people of our city; Mysterious murder stories |
Entities
Entities are specific elements and concepts identified within the article. For more details, see our Entities Guide. They help you understand exactly who or what the content is about. Our tagger detects two main types of entities:
- Named Entities: Specific people, locations, organizations, products, and events (e.g., Barack Obama, Paris, Apple Inc., World War II).
- General Entities: Broad concepts and keywords that represent meaningful ideas (e.g., electric vehicle, income tax, flu season).
Entities are typically explicitly mentioned in the text. For example, when an article is assigned an Albert Einstein tag, the name Einstein usually appears directly in the text. However, entities can also be assigned indirectly. This is especially common for:
- Derived Entities: Entities that are mentioned indirectly through other entities. For instance, an article mentioning French cities or regions is assigned the derived entity France as well.
- Geneea Topics: See the section on the fuzzy boundary between entities and topics below.
Topic Categories (Taxonomies)
Topic categories are controlled vocabularies, often organized into stable hierarchical structures. Unlike free-form tagging — where different editors might use varying words for the same concept — categories provide standardized topics with stable identifiers. This is essential for reliable, large-scale content classification and metadata exchange.
Geneea automatically assigns categories from two major industry-standard taxonomies:
- IPTC Media Topics: Designed specifically for editorial and journalistic workflows. Maintained by the IPTC, this taxonomy contains over 1,200 topics across five levels. It reflects how newsrooms structure coverage (e.g., Politics and Government → Election → National Elections) and is ideal for editorial analytics, archive organization, and content discovery.
- IAB Content Taxonomy: Designed for the digital advertising ecosystem. It provides a shared vocabulary for publishers, ad platforms, and advertisers to align content for targeting and monetization.
- IAB Brand Safety Categories: As part of the IAB framework, we evaluate content for brand safety and suitability. Using semantic analysis instead of keyword-blocking, the tagger assesses context and risk across sensitive categories (like Adult Content or Violence). Keyword-based blocking is difficult to maintain and often produces false positives that inadvertently block legitimate content. Semantic analysis provides a more accurate, context-aware classification without these drawbacks.
A category captures the overall meaning of a text; it doesn't require a specific word to be present. For example, an article might be categorized under Tennis without ever explicitly using the word.
We can also provide custom categories mapped to your organization's specific needs.
Topics
While taxonomies rely on a strictly controlled hierarchy, Topics are a more open and flexible set of labels.
-
Geneea Topics: These provide a broader, reader-friendly classification. While similar in purpose to IPTC or IAB categories, Geneea Topics focus on straightforward, easy-to-digest themes like science and technology, economy, or culture, making them ideal for frontend display and general reader navigation.
-
Editorial Tags: These are custom, user-defined tags specific to your organization's unique content strategy (e.g., Important people of our city or Mysterious murder stories). Editorial tags are recall-centric. The system aims to offer a wide variety of relevant suggestions. We consider it a success if a journalist accepts one out of every five suggested editorial tags.
Entities vs. Topics
The boundary between entities and topics is fluid.
The API handles this overlap seamlessly because both entities and topics link to the same knowledge base, and
they share the exact same Geneea Knowledge Base ID (e.g., G847 for tennis).
Depending on how the concept appears in the text, you will see one of three behaviors in the API response:
- Entity and Topic:
If the article is about tennis and explicitly mentions it,
the API returns the
tennistag with a high relevance score, thefeats.topic: geneeaproperty, and the mentions included in thementionsarray. - Topic Only:
If the article is clearly about tennis (e.g., discussing Wimbledon, racquets, and Grand Slams)
but never actually uses the word "tennis", the API returns the same tag.
It will have the same GKB ID, high relevance score, and
feats.topic: geneeaproperty, but thementionsarray will be empty. - Entity Only:
If the article only mentions tennis in passing
(e.g., an article about a city budget that funds a new tennis court among other things),
it is treated strictly as an entity.
The API returns the tag with a populated
mentionsarray, but with a lower relevance score and without thefeats.topic: geneeaproperty, since the article itself is not about tennis.
This shared architecture ensures that your downstream systems can accurately group content by subject matter, regardless of whether the author explicitly named the concept, just described it, or merely mentioned it in passing.
Working with Tags
When implementing Geneea Media Tags, there are a few rules you should follow depending on your use case:
1. Use Geneea Knowledge Base ID
Always rely on the Geneea Knowledge Base ID (gkbId) rather than the standard form (the name) of the tag:
- The IDs distinguish between tags with the same name:
Brian Cox, the physicist (
G463581), the actor (G34975), the film director (G4963453), etc. - The IDs help when the tag's name changes:
Prince Charles and Charles III both have the ID
G43274. - The IDs link the same tag across languages:
“United States,” “Vereinigte Staaten,” and “Estados Unidos” all have the same ID (
G30). - The IDs link to the Geneea Knowledge Base, which provides additional information about the entity (its relations to other entities, IDs for other knowledge bases, etc.).
Note that sometimes the ID might change. This happens when we discover duplicate entries in our knowledge base and choose one of them as the primary entry. This guide explains how our API communicates such changes and how to handle them.
2. Use Relevance
If you are passing tags to an analytics dashboard (e.g., to track coverage trends), do not rely on simple mention counting.
Our relevance score (the feats.relevance field) is much more reliable:
it also considers the position of each mention, indirect mentions, and other factors.
- High Relevance: The article is about the topic.
- Low Relevance: The topic is mentioned in passing.
3. External Mappings
Many of our semantic tags include external identifiers in the feats object.
For example, named entities often contain a Wikidata ID, and IPTC tags map to official IPTC concept URIs.
Use these fields to seamlessly connect Geneea's output with external knowledge graphs or standardized industry tools.