Semantic Tagging

The Media API can perform semantic tagging of articles. Semantic tags are entities, keywords or concepts relevant for the article. We rank and standardize them based on their purpose. For a non-technical overview, see this page and this case study.

Below, we discuss various technical topics related to obtaining semantic tags:

  • First steps: basic common setup for calling the API required by the rest of this article

  • Basic tagging: a simple call to the API to obtain semantic tags

  • Other features: depending on your configuration, entities, sentiment and other information can be returned as well

  • Paragraphs: handle lead and multiple text paragraphs

  • Topic categories: improve the analysis by specifying topic or sections of the article

  • Presentation language: return tags and entities in a particular language

  • Knowledge base properties: information about tags and entities drawn from the knowledge base as part of the call

For a full description of the API, see the reference guide.

First Steps

To use the API, you need a valid API key with appropriate authorizations. Please get in touch with us if you do not have it here.

Note that we do not provide SDKs for the API yet, but our G3 SDKs can be used to perform NLP analysis.

Common Basic Code

We will first define some common code (replace <YOUR_API_KEY> with your API key):

No special setup necessary

Tags – Basic analysis

To perform a basic analysis of a document to obtain tags (keywords) use the following code:

curl -X POST -H 'X-API-KEY: <YOUR_API_KEY>' -H 'accept: */*' -H 'content-type: application/json' 'http://media-api.geneea.com/v2/nlp/analyze' -d '{
    "id": "1234",
    "title": "Emmanuel Macron in Germany.",
    "text": "Mr. Macron visited a trade show in Munich."
}'

the above code produces a result similar to the following. The result might also contain relations, sentiment etc., depending on your configuration – see below.

{
    "version": "3.3.0",
    "id": "1234",
    "language": {"detected": "en"},
    "tags": [
        {"id": "t1", "gkbId": "G3052772", "stdForm": "Emmanuel Macron", "type": "media", "relevance": 96.0, "feats": {"wikidataId": "Q3052772", "gkbEntityType": "person"}},
        {"id": "t2", "gkbId": "G183", "stdForm": "Germany", "type": "media", "relevance": 94.0, "feats": {"wikidataId": "Q183", "gkbEntityType": "location"}},
        {"id": "t3", "gkbId": "G1726", "stdForm": "Munich", "type": "media", "relevance": 66.0, "feats": {"wikidataId": "Q1726", "gkbEntityType": "location"}},
        {"id": "t4", "gkbId": "IPTC-11000000", "stdForm": "politics", "type": "media-topic", "relevance": 68.51, "feats": {"MediaTopicId": "11000000", "wikidataId": "Q7163", "gkbEntityType": "general"}}
    ]
    "usedChars": 100,
    "metadata": {"referenceKey": "241014-164726-9bdaf485"},
}

In this case, we see two types of tags

  • entity-based tags ("type": "media"): this is a selection of the most important entities, names (e.g., organizations, cities) and keywords (see here).

  • IPTC media topics ("type": "media-topic"): an industry-standard taxonomy used for categorizing articles by content. The current version consists of over 1200 categories organized into a hierarchy of up to 5 levels. The above result contains politics, other examples are sport, basketball, music, classical music, etc. For more detail, see this article.

Each tag has

  • A unique identifier (e.g., "gkbId": "G183") that links it to our knowledge base.

  • A standard name in one of the supported languages (e.g., "stdForm": "Germany"); see Presentation Language below

  • A relevance score between 0 and 100 (e.g., "relevance": 94.0);, which indicates its importance in relation to both the article and the customer’s needs. This is distinct from entity relevance, which only considers the article itself when determining importance.

  • Third-party identifiers, such as Wikidata or IPTC media topics

  • The type of the corresponding knowledge base item (person, organization, location, event, product, general)

  • An internal identifier (e.g., "id": "t2") used for cross-referencing in more complex configurations.

Entities, sentiment, etc.

The exact set of returned features depends on your account plan and configuration. Above, the result contains just tags, which is the most common situation. However, other information can be included as well (entities, relations, document sentiment), for example:

{
    "version": "3.3.0",
    "id": "1234",
    "language": {"detected": "en"},
    "entities": [
        {"id": "e0", "gkbId": "G57305", "stdForm": "trade fair", "type": "general", "feats": {"relevance": "11", "ranking": "11"}},
        {"id": "e1", "gkbId": "G183", "stdForm": "Germany", "type": "location", "feats": {"derivedBy": "country", "relevance": "94", "ranking": "94"}},
        {"id": "e2", "gkbId": "G1726", "stdForm": "Munich", "type": "location", "feats": {"derivedBy": "city", "relevance": "66", "ranking": "66"}},
        {"id": "e3", "gkbId": "G3052772", "stdForm": "Emmanuel Macron", "type": "person", "feats": {"relevance": "96", "ranking": "96"}},
        {"id": "e4", "gkbId": "G980", "stdForm": "Bavaria", "type": "location", "feats": {"derivedBy": "region", "derivedOnly": "true", "relevance": "42", "ranking": "42"}},
        {"id": "e5", "gkbId": "G10562", "stdForm": "Upper Bavaria", "type": "location", "feats": {"derivedBy": "district", "derivedOnly": "true", "relevance": "41", "ranking": "41"}}
    ]
    "tags": [
        {"id": "t1", "gkbId": "G3052772", "stdForm": "Emmanuel Macron", "type": "media", "relevance": 96.0, "feats": {"wikidataId": "Q3052772", "gkbEntityType": "person"}},
        {"id": "t2", "gkbId": "G183", "stdForm": "Germany", "type": "media", "relevance": 94.0, "feats": {"wikidataId": "Q183", "gkbEntityType": "location"}},
        {"id": "t3", "gkbId": "G1726", "stdForm": "Munich", "type": "media", "relevance": 66.0, "feats": {"wikidataId": "Q1726", "gkbEntityType": "location"}},
        {"id": "t4", "gkbId": "IPTC-11000000", "stdForm": "politics", "type": "media-topic", "relevance": 68.51, "feats": {"MediaTopicId": "11000000", "wikidataId": "Q7163", "gkbEntityType": "general"}}
    ]
    "usedChars": 100,
    "metadata": {"referenceKey": "241014-164726-9bdaf485"},
}

In addition to tags, we now also receive entities. Here are some important points to note:

  • The media tags are a subset of entities. In this setup, the relevance of tags is equal to the relevance of the corresponding entities, meaning we can view these tags as the most relevant entities. However, this is not always the case. Entity relevance is determined solely by the content of the article, while tag relevance may be influenced by other factors. Additionally, while we can view tags as the top N entities, we can adjust the relevance of specific types of tags or even individual tags based on their context. For instance, locations may be more relevant in a travel section than in a sports section.

  • Some entities are classified as derived entities. For example, the state of Bavaria and the region of Upper Bavaria are not explicitly mentioned, but they are included because the text references Munich. The entity of Germany combines both explicit and implicit mentions, as it is directly stated but also referenced indirectly through Munich.

  • Certain information is encoded as features (e.g., relevance, nature of derived entities). These features are represented as key-value pairs, where both the keys and values are always strings. If a feature has a different semantic type (e.g., relevance is a number), it must be converted.

Paragraphs

The API and the SDKs support easy specification of the title and body of an article. To specify other types of paragraphs (e.g., the lead paragraph) or multiple text paragraphs, it is necessary to use the paraSpecs field. Currently, the standard public API distinguishes title, abstract (lead) and text (body) paragraph types.

curl -X POST -H 'X-API-KEY: <YOUR_API_KEY>' -H 'accept: */*' -H 'content-type: application/json' 'http://media-api.geneea.com/v2/nlp/analyze' -d '{
    "id": "1234",
    "paraSpecs": [
        {"type": "title", "text": "Macron in Germany."},
        {"type": "abstract", "text": "Emmanuel Macron is visiting Germany again."},
        {"type": "text", "text": "Mr. Macron visited a trade show in Munich."}
    ]
}'

Topic categories (sections)

Often, the topic of the article is known before the analysis. For example, the article is published within a certain section of the website (e.g. sport, hobby). Providing this information is optional, because an automatic detection of article topic is always run as part of the analysis. However, when available, it further improves the quality of the results. We support two types of topic categories:

  • standard IPTC media topics, and

  • custom categories/sections of the publisher. The custom categories have to be configured on our side to have any effect.

These two types can be even combined, as you can see in the example below:

curl -X POST -H 'X-API-KEY: <YOUR_API_KEY>' -H 'accept: */*' -H 'content-type: application/json' 'http://media-api.geneea.com/v2/nlp/analyze' -d '{
    "id": "1234",
    "title": "Emmanuel Macron in Germany.",
    "text": "Mr. Macron visited a trade show in Munich.",
    "presentationLanguage": "fr",
    "categories": [{"taxonomy": "MediaTopic", "code": "11000000"}, {"taxonomy": "Custom", "code": "politics"} ]
}'

Presentation Language

Above, the entities and tags were reported in the language of the document, i.e. English. However, we can request them in other languages as well (currently, Czech, Dutch, English, French, German, Polish, Portuguese, Slovak, and Spanish are supported) using the parameter presentationLanguage with the ISO code of the desired language:

curl -X POST -H 'X-API-KEY: <YOUR_API_KEY>' -H 'accept: */*' -H 'content-type: application/json' 'http://media-api.geneea.com/v2/nlp/analyze' -d '{
    "id": "1234",
    "title": "Emmanuel Macron in Germany.",
    "text": "Mr. Macron visited a trade show in Munich.",
    "presentationLanguage": "fr"
}'

produces the following result (see the Analysis reference page for explanation, Note that we have omitted the relations field for simplicity).

{
    "version": "3.3.0",
    "id": "1234",
    "language": {"detected": "en"},
    "entities": [
        {"id": "e0", "gkbId": "G57305", "stdForm": "salon", "type": "general", "feats": {"relevance": "11", "ranking": "11"}},
        {"id": "e1", "gkbId": "G183", "stdForm": "Allemagne", "type": "location", "feats": {"derivedBy": "country", "relevance": "94", "ranking": "94"}},
        {"id": "e2", "gkbId": "G1726", "stdForm": "Munich", "type": "location", "feats": {"derivedBy": "city", "relevance": "66", "ranking": "66"}},
        {"id": "e3", "gkbId": "G3052772", "stdForm": "Emmanuel Macron", "type": "person", "feats": {"relevance": "96", "ranking": "96"}},
        {"id": "e4", "gkbId": "G980", "stdForm": "Bavière", "type": "location", "feats": {"derivedBy": "region", "derivedOnly": "true", "relevance": "42", "ranking": "42"}},
        {"id": "e5", "gkbId": "G10562", "stdForm": "Haute-Bavière", "type": "location", "feats": {"derivedBy": "district", "derivedOnly": "true", "relevance": "41", "ranking": "41"}}
    ]
    "tags": [
        {"id": "t1", "gkbId": "G3052772", "stdForm": "Emmanuel Macron", "type": "media", "relevance": 96.0, "feats": {"wikidataId": "Q3052772", "gkbEntityType": "person"}},
        {"id": "t2", "gkbId": "G183", "stdForm": "Allemagne", "type": "media", "relevance": 94.0, "feats": {"wikidataId": "Q183", "gkbEntityType": "location"}},
        {"id": "t3", "gkbId": "G1726", "stdForm": "Munich", "type": "media", "relevance": 66.0, "feats": {"wikidataId": "Q1726", "gkbEntityType": "location"}},
        {"id": "t4", "gkbId": "IPTC-11000000", "stdForm": "Politique", "type": "media-topic", "relevance": 68.51, "feats": {"MediaTopicId": "11000000", "wikidataId": "Q7163", "gkbEntityType": "general"}}
    ]
    "usedChars": 100,
    "metadata": {"referenceKey": "241014-164726-ab2eaf07"},
}

If you need the entities and tags translated in multiple languages, see Multiple Presentation Languages.

Knowledge base properties

Knowledge base properties can be returned along with tags and entities. The exact set of features is configurable, the example below returns the description for each tag/entity.

A GKB property has three types of attributes:

  • name: a language-independent identifier. There might be multiple properties with the same name (e.g., multiple occupations).

  • label: a human-readable label of the property in the presentation language of the analysis

  • boolValue/floatValue/intValue/strValue: the value of the property. Exactly one of these attributes is non-empty.

If a given property does not exist for a particular tag or entity, it is not returned at all.

curl -X POST -H 'X-API-KEY: <YOUR_API_KEY>' -H 'accept: */*' -H 'content-type: application/json' 'http://media-api.geneea.com/v2/nlp/analyze' -d '{
    "id": "1234",
    "title": "Emmanuel Macron in Germany.",
    "text": "Mr. Macron visited a trade show in Munich."
}'
{
  "version": "3.3.0",
  "id": "1234",
  "language": { "detected": "en" },
  "tags": [
    { "id": "t0", "gkbId": "G3052772", "stdForm": "Emmanuel Macron", "type": "media", "relevance": 22.605,
        "feats": { "wikidataId": "Q3052772" },
        "gkbProperties": [{"name": "description", "label":  "description", "strValue": "President of France and Co-Prince of Andorra since 2017"}]
    },
    { "id": "t1", "gkbId": "G183", "stdForm": "Germany", "type": "media", "relevance": 18.365,
        "feats": { "wikidataId": "Q183" },
        "gkbProperties": [{"name": "description", "label":  "description", "strValue": "country in Central Europe"}]
    },
    { "id": "t2", "gkbId": "G1726", "stdForm": "Munich", "type": "media", "relevance": 7.57,
        "feats": { "wikidataId": "Q1726" },
        "gkbProperties": [{"name": "description", "label":  "description", "strValue": "capital and most populous city of Bavaria, Germany"}]
    }
  ],
  "usedChars": 100,
  "metadata": {"referenceKey": "311441-120020-a24f0281"}
}