#include <bits/stdc++.h>

Functions
wstring	UnicodeNormalize (wregex word_pattern, wstring sentence_w)
	nfkc normalize sentence by icu::Normalizer2

wstring	TranslateToFullwidth (const wstring &sentence_w)
	Replace a specific string from half-width to full-width.

wstring	RemoveExtraSpaces (const wstring &sentence)
	remove half-width spaces that meet the conditions

string	NormalizeNeologd (string sentence)
	Neologd Normalized function.

int	Normalizer (string input_path, string output_path)

Function Documentation

◆ NormalizeNeologd()

string NormalizeNeologd ( string sentence )

Neologd Normalized function.

Perform the normalization process described in the link below.
https://github.com/neologd/mecab-ipadic-neologd/wiki/Regexp.ja

Example:

string sentence= "検索エンジン自作入門を買いました!!!";

sentence = NormalizeNeologd(sentence); //"検索エンジン自作入門を買いました"

NormalizeNeologd

string NormalizeNeologd(string sentence)

Neologd Normalized function.

Definition normalizer.cpp:174

Parameters

const string& sentence: text sentence

Returns: wstring: sentence has been processed

Attention: This process is for Japanese text. Do not use English text or code in your corpus.
For example, in English text, spaces between words will be removed.

Definition at line 174 of file normalizer.cpp.

◆ Normalizer()

int Normalizer	(	string	input_path,
		string	output_path )

◆ RemoveExtraSpaces()

wstring RemoveExtraSpaces ( const wstring & sentence )

remove half-width spaces that meet the conditions

Replace one or more half-width spaces with one half-width space.
And Remove half-width spaces included in the following conditions.

Half-width spaces included between "hiragana, full-width katakana, <br> half-width katakana, kanji, and full-width symbols"
Half-width space included between "hiragana, full-width katakana, <br> half-width katakana, kanji, <br> full-width symbols" and "half-width alphanumeric characters"

Example:

wstring sentence= "（）";

sentence = TranslateToFullwidth(sentence); //"()"

TranslateToFullwidth

wstring TranslateToFullwidth(const wstring &sentence_w)

Replace a specific string from half-width to full-width.

Definition normalizer.cpp:91

Parameters

const string& sentence: text sentence

Returns: wstring: sentence has been processed

Note

Definition at line 135 of file normalizer.cpp.

◆ TranslateToFullwidth()

wstring TranslateToFullwidth ( const wstring & sentence_w )

Replace a specific string from half-width to full-width.

Replace the following full-width symbols with half-width symbols
/！”＃＄％＆’（）＊＋，−．／：；＜＞？＠［￥］＾＿｀｛｜｝

Example:

wstring sentence= "（）";

sentence = TranslateToFullwidth(sentence); //"()"

Parameters

const string& sentence: text sentence

Returns: wstring: sentence has been processed

Note

Definition at line 91 of file normalizer.cpp.

◆ UnicodeNormalize()

wstring UnicodeNormalize	(	wregex	word_pattern,
		wstring	sentence_w )

nfkc normalize sentence by icu::Normalizer2

Search for words that match the word_pattern regular expression in the sentence and perform NFKC normalization using icu::Normalizer2.

Example:

wstring sentence = L"０１２３４５６７８９";  
static wregex word_pattern(L"(([０-９]+))");  
wstring normalized_sentence = UnicodeNormalize(word_pattern, sentence)  
// normalized_sentence == L"0123456789"  

Parameters

wregex	word_pattern: Regular expression for string to be normalized
wstring	sentence

Returns: wstring: normalized sentence https://ja.wikipedia.org/wiki/UnicodeE4B8%80E8A6A7_0000-0FFF

Note

Definition at line 38 of file normalizer.cpp.

Functions

Function Documentation

◆ NormalizeNeologd()

◆ Normalizer()

◆ RemoveExtraSpaces()

◆ TranslateToFullwidth()

◆ UnicodeNormalize()