Archive Analysis

In addition to analyzing articles through our Media REST API, we also support analysis of a collection of articles “offline”. Typically, this is a part of the onboarding process, and the purpose is twofold: (i) to analyze a large number of historical articles, and (ii) to tune our models to provide even better results for your articles.

In comparison with the standard API-based analysis, this allows our systems to see all the articles at the same time and learn from various relationships that are hard to see when articles are analyzed separately. Also, our data analysts can tune our NLP models, improve the knowledge base, etc.

Below, we explain how to format an export of your archive or a sample of articles. Zip or gzip the resulting file and share it with us in a mutually agreed manner (email, S3, GDrive, etc.). You might use a password with the zip file.

File format

We accept the following formats (in the order of preference):

  • JSON Lines (jsonl). Each article is saved as a separate json object, one such object per line; encoded in UTF-8.

  • JSON. An array of json objects, each storing a single article; encoded in UTF-8.

  • CSV:

    • Use a header; use UTF-8.

    • Follow the RFC 4180 specification, especially:

      • enclose all fields containing line breaks, double quotes and/or commas into double quotes,

      • use double quotes to escape double quotes.

      We support both CRLF and LF characters as line breaks and record separators.

  • MS Excel in the OOXML format (xlsx):

    • The first row should contain the header.

    • Note that this format has certain size limitations: a file cannot contain more than 1M rows, and a cell cannot store more than 32k characters.

  • XML: Only if the other formats are really, really impossible to use for you.

Content format

Use the following strings as the keys in jsonl/json, as header values in CSV and Excel and as tag names in XML.

  • id - the article id

  • date - the publishing date in the ISO 8601 format, e.g., 2022-08-24. The date might be followed by time, e.g., 2022-08-24T12:13:24+01:00.

  • title - the title of the article in plain text

  • lead - the lead (perex) of the article in plain text

  • body - the body of the article in HTML or plain text. Ideally, there should be no advertisements, “you might also be interested in” links, etc. In plain text, either use an empty line to separate paragraphs (i.e., two new lines) or use new lines exclusively to separate paragraphs.

  • sections - an array of the section names the article appeared in. The values are publisher-specific; typically values are politics, sports, business etc. In CSV and Excel files, use | as the separator.

  • url - a URL at which the article was published

All of the fields are optional, but you have to provide at least one of title, lead, body. You can add any additional article metadata into additional keys/columns.

Example

This is an example of the content of a jsonl file containing two articles:

{"id": "a1", "date": "2022-08-24", "title": "A new continent discovered", "lead": "Scientists discovered a ....", "body": "For millions of years, ...", sections: ["news"], "url": "https://news-today.com/news/article123"}
{"id": "a2", "date": "2022-08-25", "title": "A new continent disappeared", "lead": "Scientists are puzzled.", "body": "Yesterday, the continent ...", sections: ["news"], "url": "https://news-today.com/news/article222"}