The following preprocessing can be performed before analysis. The parameters
Text extraction - extract plain text from various data formats.
Content extraction - extract main content from an html page. Use the parameter Request’s
Sentence segmentation - in the media domains, any linebreak (newline character) separates sentences. In the other domains, at least two linebreaks, possibly separated by other whitespace, are needed.
Spelling correction - fixes some common spelling errors. Correction is automatically run when the Request’s
textTypefield is set to
Adding diacritics - adds diacritical marks to a text written without them. Currently only Czech is supported. Use the field Request’s
none- do nothing
auto- add diacritics if necessary
yes- add diacritics
redo- remove and then add diacritics