|
Corpus Cleaner
|
#include "normalizer.hpp"#include "util.hpp"#include <unicode/datefmt.h>#include <unicode/dtfmtsym.h>#include <unicode/gregocal.h>#include <unicode/timezone.h>#include <unicode/unistr.h>#include <unicode/ustring.h>#include <unicode/dtptngen.h>#include <unicode/dtitvfmt.h>#include <unicode/normalizer2.h>Go to the source code of this file.
Functions | |
| wstring | UnicodeNormalize (wregex word_pattern, wstring sentence_w) |
| nfkc normalize sentence by icu::Normalizer2 | |
| wstring | TranslateToFullwidth (const wstring &sentence_w) |
| Replace a specific string from half-width to full-width. | |
| wstring | RemoveExtraSpaces (const wstring &sentence) |
| remove half-width spaces that meet the conditions | |
| string | NormalizeNeologd (string sentence) |
| Neologd Normalized function. | |
| string NormalizeNeologd | ( | string | sentence | ) |
Neologd Normalized function.
Perform the normalization process described in the link below.
https://github.com/neologd/mecab-ipadic-neologd/wiki/Regexp.ja
Example:
| const | string& sentence: text sentence |
Definition at line 174 of file normalizer.cpp.
| wstring RemoveExtraSpaces | ( | const wstring & | sentence | ) |
remove half-width spaces that meet the conditions
Replace one or more half-width spaces with one half-width space.
And Remove half-width spaces included in the following conditions.
Example:
| const | string& sentence: text sentence |
Definition at line 135 of file normalizer.cpp.
| wstring TranslateToFullwidth | ( | const wstring & | sentence_w | ) |
Replace a specific string from half-width to full-width.
Replace the following full-width symbols with half-width symbols
/!”#$%&’()*+,−./:;<>?@[¥]^_`{|}
Example:
| const | string& sentence: text sentence |
Definition at line 91 of file normalizer.cpp.
| wstring UnicodeNormalize | ( | wregex | word_pattern, |
| wstring | sentence_w ) |
nfkc normalize sentence by icu::Normalizer2
Search for words that match the word_pattern regular expression in the sentence and perform NFKC normalization using icu::Normalizer2.
Example:
| wregex | word_pattern: Regular expression for string to be normalized |
| wstring | sentence |
Definition at line 38 of file normalizer.cpp.