#include <corpus_cleaner.hpp>

Public Member Functions
	CorpusCleaner (string input_path, string output_path, uint32_t min_length, uint32_t max_length, set< string > accept_language, bool store_rejected, bool sentence_segment, float language_threshold, double perplexity_threshold, GenerateDedupLSH generate_dedup_lsh, LSHDeduplicator deduplicator)

	~CorpusCleaner ()

void	Normalizer (Document &document)
	Neologd Normalize sentence.

void	URLRemover (Document &document)
	Remove URLs matching regular expression.

void	SpecialCharacterRemover (Document &document)
	Remove special character. For example, ☀, ♡, ☆, and so on.

void	EmojiRemover (Document &document)
	Remove emoji. For example, 🤗, 🐉, 📊, and so on.

void	QuotesRemover (Document &document)
	Remove quotes. For example, [1], {245}, and so on.

void	LengthFilter (Document &document)
	Remove too long sentence and too short sentence.

void	LanguageFilter (Document &document)
	Language filtering using fastText.

void	PerplexityFilter (Document &document)
	KenLM's Perplexity Quality filtering.

void	MinhashDeduplication (Document &document)
	MinHashLSH Deduplication files in the this->intermediate folder.

void	ZeroPunctuationFilter (Document &document)
	Remove sentence without punctuation.

void	SentenceSegmenter (string input_folder_path, string output_folder_path)
	Simple sentence splitter for japanese text.

Stats	PipelineStep (Document &document, void(CorpusCleaner::*cleaner)(Document &))
	PipelineStep.

int32_t	CleanPipeline (void)
	Pipeline that sequentially executes the configured CorpusCleaner methods.

void	StoreException (string function_name, string reference)
	Save exception in file.

Detailed Description

Definition at line 64 of file corpus_cleaner.hpp.

Constructor & Destructor Documentation

◆ CorpusCleaner()

CorpusCleaner::CorpusCleaner	(	string	input_path,
		string	output_path,
		uint32_t	min_length,
		uint32_t	max_length,
		set< string >	accept_language,
		bool	store_rejected,
		bool	sentence_segment,
		float	language_threshold,
		double	perplexity_threshold,
		GenerateDedupLSH *	generate_dedup_lsh,
		LSHDeduplicator *	deduplicator )

Definition at line 239 of file corpus_cleaner.cpp.

◆ ~CorpusCleaner()

CorpusCleaner::~CorpusCleaner ( )

Definition at line 290 of file corpus_cleaner.cpp.

Member Function Documentation

◆ CleanPipeline()

int32_t CorpusCleaner::CleanPipeline ( void )

Pipeline that sequentially executes the configured CorpusCleaner methods.

Perform the following steps in order.

Set CorpusCleaner process to pipeline_list that will be executed. (Please read attention.)
Loop processing as many times as pipeline_list
2-1. copy output folder to intermediate folder
2-2. Get list of files in intermediate folder.
2-3. Execute the each CorpusCleaner processing on all files in the intermediate folder.

Example:

string input_folder_path = "../../results/dataset/input/";
string output_folder_path = "../../results/dataset/output/";
uint32_t min_length= 5;
uint32_t max_length = 5000;
set<string> accept_language{"__label__ja"};
bool store_rejected = true; 
bool execute_sentence_segment = false; // TODO: switch true
double language_threshold = 0.3;
double perplexity_threshold = 40000;
 
string blacklist_file_path = output_folder_path+"/blacklist.txt";
GenerateDedupLSH generate_dedup_lsh(4,200,20,10);
LSHDeduplicator deduplicator(true,blacklist_file_path,true,1280000000);
 
// create instance
CorpusCleaner corpus_cleaner(input_folder_path,
                             output_folder_path,
                             min_length,
                             max_length,
                             accept_language,
                             store_rejected,
                             execute_sentence_segment,
                             language_threshold,
                             perplexity_threshold,
                             &generate_dedup_lsh,
                             &deduplicator);
 
// Execute cleaning pipeline
corpus_cleaner.CleanPipeline(); 

Parameters

void None

Returns: None

Attention

CorpusCleaner processing is performed in the order set in Cleaner_array.
For example, set cleaner_array as follows:

vector<void (CorpusCleaner::*)(Document &)> cleaner_list = { 
    &CorpusCleaner::URLRemover ,  
    &CorpusCleaner::LengthFilter,  
    &CorpusCleaner::SpecialCharacterRemover
};  

At this time, processing is performed in the order of

URLRemover, 2. LengthFilter, and 3. SpecialCharacterRemover.

Definition at line 751 of file corpus_cleaner.cpp.

◆ EmojiRemover()

void CorpusCleaner::EmojiRemover ( Document & document )

Remove emoji. For example, 🤗, 🐉, 📊, and so on.

Remove emoji characters that is \U0001F300(🌀) to \U0001F9FF(🧿).
The C++ regex library does not support 4-byte characters.
Therefore, characters like 🌀 cannot be matched using regular expressions.
So, in a full search, those that completely match the pictogram are searched and removed.

Parameters

Document &document: single line text to be cleaned

Returns: void: None https://guppy.eng.kagawa-u.ac.jp/OpenCampus/unicode.html

Note

Definition at line 467 of file corpus_cleaner.cpp.

◆ LanguageFilter()

void CorpusCleaner::LanguageFilter ( Document & document )

Language filtering using fastText.

string in = "吾輩は猫である。名前はまだ無い。";
FastTextEx language_filter;
pair<float, string> score;
score = language_filter.filter(in);
// score.first ==1.00005, score.second ==__label__ja
 
string in2 = "I am a cat. No name yet.";
score = language_filter.filter(in2);
// score.first ==0.75237, score.second ==__label__en 

Parameters

Document &document: single line text to be cleaned

Returns: void: None https://github.com/neologd/mecab-ipadic-neologd/wiki/Regexp.ja#python-written-by-hideaki-t–overlast https://fasttext.cc/docs/en/supervised-tutorial.html

Note

Definition at line 365 of file corpus_cleaner.cpp.

◆ LengthFilter()

void CorpusCleaner::LengthFilter ( Document & document )

Remove too long sentence and too short sentence.

Remove too long sentence that is length is more thanand too short sentence.
The length of too long sentence is more than "max_length".
The length of too short sentence is lesser than "min_length".

Parameters

Document &document: single line text to clean be cleaned

Returns: void: None

Note

Definition at line 309 of file corpus_cleaner.cpp.

◆ MinhashDeduplication()

void CorpusCleaner::MinhashDeduplication ( Document & document )

MinHashLSH Deduplication files in the this->intermediate folder.

Follow the steps below to remove duplication between all lines of all files in the this->intermediate folder.

Get the list of files in this->intermediate_folder and set it to vector<string> file_list
Compare all lines of source_file and target_file in file_list.
Check duplication between all lines of souce file and all lines of target_file.
Therefore, characters like 🌀 cannot be matched using regular expressions.
I considered deduplication using set or multiset, but I did not use this method because the file size could exceed the memory capacity.

Example:

Parameters

string	input_folder_path: input folder path
string	output_folder_path: output folder path

Returns: Stats: statics imformation of this function. @note TODO: fix return stats.

Definition at line 569 of file corpus_cleaner.cpp.

◆ Normalizer()

void CorpusCleaner::Normalizer ( Document & document )

Neologd Normalize sentence.

Please Refer document of "NormalizeNeologd()".

Parameters

string	input_path: The path of filterd file.
string	output_path: The output path of results file.

Returns: Stats: statics imformation of this function. https://github.com/neologd/mecab-ipadic-neologd/wiki/Regexp.ja#python-written-by-hideaki-t–overlast

Note

Definition at line 540 of file corpus_cleaner.cpp.

◆ PerplexityFilter()

void CorpusCleaner::PerplexityFilter ( Document & document )

KenLM's Perplexity Quality filtering.

Please Refer document of "TODO"

If the perplexity is less than "threshold", the "document" is to be rejected.

Example:

Parameters

Document &document: single line text to be cleaned

Returns: void: None

Note

Definition at line 331 of file corpus_cleaner.cpp.

◆ PipelineStep()

Stats CorpusCleaner::PipelineStep	(	Document &	document,
		void(CorpusCleaner::*)(Document &)	cleaner )

PipelineStep.

Parameters

Document	&document: document is to be filtered
void	(CorpusCleaner::*cleaner)(Document &): filter function list

Returns: Stats: statics imformation of this function.

Definition at line 673 of file corpus_cleaner.cpp.

◆ QuotesRemover()

void CorpusCleaner::QuotesRemover ( Document & document )

Remove quotes. For example, [1], {245}, and so on.

Remove remarks matching regular expression.
The regular expression is "(\[([0-9]+)\]|\{([0-9]+)\})".

Parameters

Document &document: single line text to be cleaned

Returns: void: None

Attention: Don't use this on corpus that contain formulas or programs.

Definition at line 490 of file corpus_cleaner.cpp.

◆ SentenceSegmenter()

void CorpusCleaner::SentenceSegmenter	(	string	input_folder_path,
		string	output_folder_path )

Simple sentence splitter for japanese text.

I used Pragmatic Segmenter's Japanese rules as a reference for sentence separation rules.
The C++ regex library does not support 4-byte characters.
Therefore, characters like 🌀 cannot be matched using regular expressions.
So, in a full search, those that completely match the pictogram are searched and removed.

Example: TODO

Parameters

string	input_path: The path of filterd file.
string	output_path: The output path of results file.

Returns: Stats: statics imformation of this function. https://github.com/wwwcojp/ja_sentence_segmenter/blob/main/ja_sentence_segmenter/split/simple_splitter.py https://github.com/diasks2/pragmatic_segmenter#golden-rules-japanese

Note

Definition at line 617 of file corpus_cleaner.cpp.

◆ SpecialCharacterRemover()

void CorpusCleaner::SpecialCharacterRemover ( Document & document )

Remove special character. For example, ☀, ♡, ☆, and so on.

Remove emoji characters that is \U00002600(☀) to \U000027ff(⟿),
\U00002190(←) to \U000021ff(⇿),\U00002300(⌀) to \U000023ff(⏿)
\U00002900(⤀) to \U0000297f(⥿),\U00002b00(⬀) to \U00002bff(⯿),
and \U0001f000(🀀) to \U0001f0ff(🃿).
The C++ regex library does not support 4-byte characters.
Therefore, characters like 🀀 cannot be matched using regular expressions.
So, in a full search, those that completely match the pictogram are searched and removed.

Example: TODO.

Parameters

string	input_path: The path of filterd file.
string	output_path: The output path of results file.

Returns: Stats: statics imformation of this function. https://guppy.eng.kagawa-u.ac.jp/OpenCampus/unicode.html

Note

Definition at line 435 of file corpus_cleaner.cpp.

◆ StoreException()

void CorpusCleaner::StoreException	(	string	function_name,
		string	reference )

Save exception in file.

Parameters

string	reference: reference infomation. For example, sentence.
string	function_name: function name cause exeption

Returns: None

Note

Definition at line 227 of file corpus_cleaner.cpp.

◆ URLRemover()

void CorpusCleaner::URLRemover ( Document & document )

Remove URLs matching regular expression.

Remove URLs matching regular expression.
The regular expression is "(https?|ftp)(:\/\/[-_\.!~*\'()a-zA-Z0-9;\/?:\@&=\+\$,%#]+)".

Parameters

Document &document: single line text to clean be cleaned

Returns: void: None

Note

Definition at line 407 of file corpus_cleaner.cpp.

◆ ZeroPunctuationFilter()

void CorpusCleaner::ZeroPunctuationFilter ( Document & document )

Remove sentence without punctuation.

Remove sentence without punctuation that is "、","､","。","｡","．",".","？","?","！","!".

Example:

Parameters

Document &document: single line text to be cleaned

Returns: void: None

Note: This filter is heuristic.
For example, a sentence that "https://github.com/" is not removed because it includes '.'.

Definition at line 512 of file corpus_cleaner.cpp.

The documentation for this class was generated from the following files:

/home/corpus-cleaner/corpus_cleaner/corpus_cleaner.hpp
/home/corpus-cleaner/corpus_cleaner/corpus_cleaner.cpp

Public Member Functions

Detailed Description

Constructor & Destructor Documentation

◆ CorpusCleaner()

◆ ~CorpusCleaner()

Member Function Documentation

◆ CleanPipeline()

◆ EmojiRemover()

◆ LanguageFilter()

◆ LengthFilter()

◆ MinhashDeduplication()

◆ Normalizer()

◆ PerplexityFilter()

◆ PipelineStep()

◆ QuotesRemover()

◆ SentenceSegmenter()

◆ SpecialCharacterRemover()

◆ StoreException()

◆ URLRemover()

◆ ZeroPunctuationFilter()