Input preprocessing¶
The following preprocessing can be performed before analysis. The parameters
Text extraction - extract plain text from various data formats.
Content extraction - extract main content from an html page. Use the parameter Request’s
htmlExtractor
with valuesdefault
article
keep-everything
Sentence segmentation - in the media domains, any linebreak (newline character) separates sentences. In the other domains, at least two linebreaks, possibly separated by other whitespace, are needed.
Spelling correction - fixes some common spelling errors. Correction is automatically run when the Request’s
textType
field is set tocasual
.Adding diacritics - adds diacritical marks to a text written without them. Currently only Czech is supported. Use the field Request’s
diacritization
with valuesnone
- do nothingauto
- add diacritics if necessaryyes
- add diacriticsredo
- remove and then add diacritics