Input preprocessing

The following preprocessing can be performed before analysis. The parameters

  • Text extraction - extract plain text from various data formats.

  • Content extraction - extract main content from an html page. Use the parameter Request’s htmlExtractor with values

    • default

    • article

    • keep-everything

  • Sentence segmentation - in the media domains, any linebreak (newline character) separates sentences. In the other domains, at least two linebreaks, possibly separated by other whitespace, are needed.

  • Spelling correction - fixes some common spelling errors. Correction is automatically run when the Request’s textType field is set to casual.

  • Adding diacritics - adds diacritical marks to a text written without them. Currently only Czech is supported. Use the field Request’s diacritization with values

    • none - do nothing

    • auto - add diacritics if necessary

    • yes - add diacritics

    • redo - remove and then add diacritics