Skip to main content

Input preprocessing

The following preprocessing can be performed before analysis. The parameters

  • Text extraction - extract plain text from various data formats.
  • Content extraction - extract main content from an html page. Use the Request's htmlExtractor parameter with these values:
    • default
    • article
    • keep-everything
  • Sentence segmentation - in the media domains, any linebreak (newline character) separates sentences. In the other domains, at least two linebreaks, possibly separated by other whitespace, are needed.
  • Spelling correction - fixes some common spelling errors. Correction is automatically run when the Request's textType parameter is set to casual.
  • Adding diacritics - adds diacritical marks to a text written without them. Currently only Czech is supported. Use the Request's parameter diacritization with these values:
    • none - do nothing
    • auto - add diacritics if necessary
    • yes - add diacritics
    • redo - remove and then add diacritics