Input preprocessing
The following preprocessing can be performed before analysis. The parameters
- Text extraction - extract plain text from various data formats.
- Content extraction - extract main content from an html page.
Use the Request's
htmlExtractor
parameter with these values:default
article
keep-everything
- Sentence segmentation - in the media domains, any linebreak (newline character) separates sentences. In the other domains, at least two linebreaks, possibly separated by other whitespace, are needed.
- Spelling correction - fixes some common spelling errors.
Correction is automatically run when
the Request's
textType
parameter is set tocasual
. - Adding diacritics - adds diacritical marks to a text written without them.
Currently only Czech is supported. Use the Request's parameter
diacritization
with these values:none
- do nothingauto
- add diacritics if necessaryyes
- add diacriticsredo
- remove and then add diacritics