Input Preprocessing
The following preprocessing steps can be performed before analysis. They are controlled via request parameters:
- Text extraction – Extracts plain text from various data formats.
- Content extraction – Extracts the main content from an HTML page.
Use the
htmlExtractorparameter in the Request with one of the following values:defaultarticlekeep-everything
- Sentence segmentation – In media domains, any line break (newline character) separates sentences. In the other domains, at least two line breaks, possibly separated by other whitespace, are required to indicate a new sentence.
- Spelling correction – Fixes common spelling mistakes.
This correction is applied automatically when
the
textTypeparameter in the Request is set tocasual. - Adding diacritics – Adds diacritical marks to text written without them
(currently supported for Czech only).
Use the
diacritizationparameter in the Request with one of the following values:none– Do not modify diacritics.auto– Add diacritics only if needed.yes– Always add diacritics.redo– First remove, then re-add diacritics.