Corpus Cleaner
Functions
normalizer.hpp File Reference
#include <bits/stdc++.h>

Go to the source code of this file.

Functions

wstring UnicodeNormalize (wregex word_pattern, wstring sentence_w)
 nfkc normalize sentence by icu::Normalizer2
 
wstring TranslateToFullwidth (const wstring &sentence_w)
 Replace a specific string from half-width to full-width.
 
wstring RemoveExtraSpaces (const wstring &sentence)
 remove half-width spaces that meet the conditions
 
string NormalizeNeologd (string sentence)
 Neologd Normalized function.
 
int Normalizer (string input_path, string output_path)
 

Function Documentation

◆ NormalizeNeologd()

string NormalizeNeologd ( string sentence)

Neologd Normalized function.

Perform the normalization process described in the link below.
https://github.com/neologd/mecab-ipadic-neologd/wiki/Regexp.ja

Example:

string sentence= "検索 エンジン 自作 入門 を 買い ました!!!";
sentence = NormalizeNeologd(sentence); //"検索エンジン自作入門を買いました"
string NormalizeNeologd(string sentence)
Neologd Normalized function.
Parameters
conststring& sentence: text sentence
Returns
wstring: sentence has been processed
Attention
This process is for Japanese text. Do not use English text or code in your corpus.
For example, in English text, spaces between words will be removed.

Definition at line 174 of file normalizer.cpp.

◆ Normalizer()

int Normalizer ( string input_path,
string output_path )

◆ RemoveExtraSpaces()

wstring RemoveExtraSpaces ( const wstring & sentence)

remove half-width spaces that meet the conditions

Replace one or more half-width spaces with one half-width space.
And Remove half-width spaces included in the following conditions.

  • Half-width spaces included between "hiragana, full-width katakana, <br> half-width katakana, kanji, and full-width symbols"
  • Half-width space included between "hiragana, full-width katakana, <br> half-width katakana, kanji, <br> full-width symbols" and "half-width alphanumeric characters"

Example:

wstring sentence= "()";
sentence = TranslateToFullwidth(sentence); //"()"
wstring TranslateToFullwidth(const wstring &sentence_w)
Replace a specific string from half-width to full-width.
Parameters
conststring& sentence: text sentence
Returns
wstring: sentence has been processed
Note

Definition at line 135 of file normalizer.cpp.

◆ TranslateToFullwidth()

wstring TranslateToFullwidth ( const wstring & sentence_w)

Replace a specific string from half-width to full-width.

Replace the following full-width symbols with half-width symbols
/!”#$%&’()*+,−./:;<>?@[¥]^_`{|}

Example:

wstring sentence= "()";
sentence = TranslateToFullwidth(sentence); //"()"
Parameters
conststring& sentence: text sentence
Returns
wstring: sentence has been processed
Note

Definition at line 91 of file normalizer.cpp.

◆ UnicodeNormalize()

wstring UnicodeNormalize ( wregex word_pattern,
wstring sentence_w )

nfkc normalize sentence by icu::Normalizer2

Search for words that match the word_pattern regular expression in the sentence and perform NFKC normalization using icu::Normalizer2.

Example:

wstring sentence = L"0123456789";
static wregex word_pattern(L"(([0-9]+))");
wstring normalized_sentence = UnicodeNormalize(word_pattern, sentence)
// normalized_sentence == L"0123456789"
wstring UnicodeNormalize(wregex word_pattern, wstring sentence_w)
nfkc normalize sentence by icu::Normalizer2
Parameters
wregexword_pattern: Regular expression for string to be normalized
wstringsentence
Returns
wstring: normalized sentence https://ja.wikipedia.org/wiki/UnicodeE4B8%80E8A6A7_0000-0FFF
Note

Definition at line 38 of file normalizer.cpp.