Skip to main content

Input Preprocessing

The following preprocessing steps can be performed before analysis. They are controlled via request parameters:

  • Text extraction – Extracts plain text from various data formats.
  • Content extraction – Extracts the main content from an HTML page. Use the htmlExtractor parameter in the Request with one of the following values:
    • default
    • article
    • keep-everything
  • Sentence segmentation – In media domains, any line break (newline character) separates sentences. In the other domains, at least two line breaks, possibly separated by other whitespace, are required to indicate a new sentence.
  • Spelling correction – Fixes common spelling mistakes. This correction is applied automatically when the textType parameter in the Request is set to casual.
  • Adding diacritics – Adds diacritical marks to text written without them (currently supported for Czech only). Use the diacritization parameter in the Request with one of the following values:
    • none – Do not modify diacritics.
    • auto – Add diacritics only if needed.
    • yes – Always add diacritics.
    • redo – First remove, then re-add diacritics.