Skip to main content

Archive Analysis

In addition to analyzing articles through our Media REST API, we also support "offline" analysis of article collections.

This is typically part of the onboarding process, and it serves two main purposes:

  1. To analyze a large number of historical articles
  2. To fine-tune our models for improved performance on your content

Compared to standard API-based analysis, offline processing enables our systems to see the entire dataset at the once, allowing us to detect patterns and relationships that may be hard to observe when articles are processed individually. It also allows our data analysts to fine-tune NLP models, enhance the knowledge base, and more.

Below, you'll find instructions for how to format and share your archive or aarticle sample. Please zip or gzip the resulting file and share it using a mutually agreed method (e.g., email, S3, GDrive).

You may use a password to secure the zip file if needed.

File Format

We accept the following formats, listed in order of preference:

  • JSON Lines (jsonl): Each article is a separate JSON object, one per line, encoded in UTF-8.
  • Escenic Content Engine imports (Stibo DX CUE format) See: http://xmlns.escenic.com/2009/import
  • JSON:
    • A single file containing an array of JSON objects (one per article), UTF-8 encoded
    • OR one JSON file per article, each UTF-8 encoded
  • CSV:
    • Must include a header row
    • UTF-8 encoding required
    • Follow RFC 4180
      • Enclose fields with line breaks, double quotes, or commas in double quotes
      • Escape double quotes by doubling them
    • Supports both LF and CRLF as line breaks and record separators
  • MS Excel in the .xlsx, OOXML format:
    • First row must contain headers
    • Limitations: Max 1 million rows per file; max 32 thousand characters per cell
  • XML: Only if the other formats are really, really impossible to use for you. Only if none of the formats above are feasible.

Content Format

Use the following field names as:

  • JSON keys (for .jsonl/.json)

  • Header names (for CSV/Excel)

  • Tag names (for XML)

  • id – Article ID

  • date – Publication date in ISO 8601, e.g., 2022-08-24. Time can be included, e.g., 2022-08-24T12:13:24+01:00

  • title – Article title in plain text

  • lead – Article lead (perex) in plain text

  • body – Article body in HTML or plain text Should not include ads or recommendation links, etc. Either use an empty line (i.e., two consecutive newlines) to separate paragraphs, or use single newlines consistently between paragraphs.

  • sections – List of section names where the article appeared (e.g., politics, sports, business). The values are publisher-specific. In CSV and Excel files, use | as the separator.

  • url – URL where the article was published

  • tags – Article tags or keywords. Two formats are supported:

    • List of strings:

      • JSONL/JSON: use an array, e.g., "tags": ["travel", "Hawaii"].
      • XML: use tag for the items, e.g., <tags><tag>travel</tag><tag>Hawaii</tag></tags>.
      • CSV/Excel: use | as the separator, e.g., travel|Hawaii.
    • List of objects: Each with:

      • id – tag ID (optional)
      • text – tag label (required)
      • type – tag type (optional)

      Use only in JSONL/JSON or XML formats. Example: {"id": "17", "text": "travel", "type": "SEO"} or {"text": "travel", "type": "SEO"}.

All fields are optional, but at least one of title, lead, or body must be provided. You may also include any additional metadata in extra fields/columns.

Example

Here is an example of a .jsonl file containing two articles:

{"id": "a1", "date": "2022-08-24", "title": "A new continent discovered", "lead": "Scientists discovered a ....", "body": "For millions of years, ...", sections: ["news"], "url": "https://news-today.com/news/article123", "tags": ["new continents", "University of Cambridge"]}
{"id": "a2", "date": "2022-08-25", "title": "A new continent disappeared", "lead": "Scientists are puzzled.", "body": "Yesterday, the continent ...", sections: ["news"], "url": "https://news-today.com/news/article222", "tags": ["new continents", "University of Oxford"]}