Corpus Cleaner
Public Member Functions | List of all members
CorpusCleaner Class Reference

#include <corpus_cleaner.hpp>

Public Member Functions

 CorpusCleaner (string input_path, string output_path, uint32_t min_length, uint32_t max_length, set< string > accept_language, bool store_rejected, bool sentence_segment, float language_threshold, double perplexity_threshold, GenerateDedupLSH *generate_dedup_lsh, LSHDeduplicator *deduplicator)
 
 ~CorpusCleaner ()
 
void Normalizer (Document &document)
 Neologd Normalize sentence.
 
void URLRemover (Document &document)
 Remove URLs matching regular expression.
 
void SpecialCharacterRemover (Document &document)
 Remove special character. For example, β˜€, β™‘, β˜†, and so on.
 
void EmojiRemover (Document &document)
 Remove emoji. For example, πŸ€—, πŸ‰, πŸ“Š, and so on.
 
void QuotesRemover (Document &document)
 Remove quotes. For example, [1], {245}, and so on.
 
void LengthFilter (Document &document)
 Remove too long sentence and too short sentence.
 
void LanguageFilter (Document &document)
 Language filtering using fastText.
 
void PerplexityFilter (Document &document)
 KenLM's Perplexity Quality filtering.
 
void MinhashDeduplication (Document &document)
 MinHashLSH Deduplication files in the this->intermediate folder.
 
void ZeroPunctuationFilter (Document &document)
 Remove sentence without punctuation.
 
void SentenceSegmenter (string input_folder_path, string output_folder_path)
 Simple sentence splitter for japanese text.
 
Stats PipelineStep (Document &document, void(CorpusCleaner::*cleaner)(Document &))
 PipelineStep.
 
int32_t CleanPipeline (void)
 Pipeline that sequentially executes the configured CorpusCleaner methods.
 
void StoreException (string function_name, string reference)
 Save exception in file.
 

Detailed Description

Definition at line 64 of file corpus_cleaner.hpp.

Constructor & Destructor Documentation

◆ CorpusCleaner()

CorpusCleaner::CorpusCleaner ( string input_path,
string output_path,
uint32_t min_length,
uint32_t max_length,
set< string > accept_language,
bool store_rejected,
bool sentence_segment,
float language_threshold,
double perplexity_threshold,
GenerateDedupLSH * generate_dedup_lsh,
LSHDeduplicator * deduplicator )

Definition at line 239 of file corpus_cleaner.cpp.

◆ ~CorpusCleaner()

CorpusCleaner::~CorpusCleaner ( )

Definition at line 290 of file corpus_cleaner.cpp.

Member Function Documentation

◆ CleanPipeline()

int32_t CorpusCleaner::CleanPipeline ( void )

Pipeline that sequentially executes the configured CorpusCleaner methods.

Perform the following steps in order.

  1. Set CorpusCleaner process to pipeline_list that will be executed. (Please read attention.)
  2. Loop processing as many times as pipeline_list
    2-1. copy output folder to intermediate folder
    2-2. Get list of files in intermediate folder.
    2-3. Execute the each CorpusCleaner processing on all files in the intermediate folder.

Example:

string input_folder_path = "../../results/dataset/input/";
string output_folder_path = "../../results/dataset/output/";
uint32_t min_length= 5;
uint32_t max_length = 5000;
set<string> accept_language{"__label__ja"};
bool store_rejected = true;
bool execute_sentence_segment = false; // TODO: switch true
double language_threshold = 0.3;
double perplexity_threshold = 40000;
string blacklist_file_path = output_folder_path+"/blacklist.txt";
GenerateDedupLSH generate_dedup_lsh(4,200,20,10);
LSHDeduplicator deduplicator(true,blacklist_file_path,true,1280000000);
// create instance
CorpusCleaner corpus_cleaner(input_folder_path,
output_folder_path,
min_length,
max_length,
accept_language,
store_rejected,
execute_sentence_segment,
language_threshold,
perplexity_threshold,
&generate_dedup_lsh,
&deduplicator);
// Execute cleaning pipeline
corpus_cleaner.CleanPipeline();
Parameters
voidNone
Returns
None
Attention
CorpusCleaner processing is performed in the order set in Cleaner_array.
For example, set cleaner_array as follows:
vector<void (CorpusCleaner::*)(Document &)> cleaner_list = {
};
void LengthFilter(Document &document)
Remove too long sentence and too short sentence.
void SpecialCharacterRemover(Document &document)
Remove special character. For example, β˜€, β™‘, β˜†, and so on.
void URLRemover(Document &document)
Remove URLs matching regular expression.
Structure for storing statistical information for each process of CorpusCleaner.
At this time, processing is performed in the order of
  1. URLRemover, 2. LengthFilter, and 3. SpecialCharacterRemover.

Definition at line 751 of file corpus_cleaner.cpp.

◆ EmojiRemover()

void CorpusCleaner::EmojiRemover ( Document & document)

Remove emoji. For example, πŸ€—, πŸ‰, πŸ“Š, and so on.

Remove emoji characters that is \U0001F300(πŸŒ€) to \U0001F9FF(🧿).
The C++ regex library does not support 4-byte characters.
Therefore, characters like πŸŒ€ cannot be matched using regular expressions.
So, in a full search, those that completely match the pictogram are searched and removed.

Parameters
Document&document: single line text to be cleaned
Returns
void: None https://guppy.eng.kagawa-u.ac.jp/OpenCampus/unicode.html
Note

Definition at line 467 of file corpus_cleaner.cpp.

◆ LanguageFilter()

void CorpusCleaner::LanguageFilter ( Document & document)

Language filtering using fastText.

string in = "εΎθΌ©γ―ηŒ«γ§γ‚γ‚‹γ€‚εε‰γ―γΎγ η„‘γ„γ€‚";
FastTextEx language_filter;
pair<float, string> score;
score = language_filter.filter(in);
// score.first ==1.00005, score.second ==__label__ja
string in2 = "I am a cat. No name yet.";
score = language_filter.filter(in2);
// score.first ==0.75237, score.second ==__label__en
Parameters
Document&document: single line text to be cleaned
Returns
void: None https://github.com/neologd/mecab-ipadic-neologd/wiki/Regexp.ja#python-written-by-hideaki-t&ndash;overlast https://fasttext.cc/docs/en/supervised-tutorial.html
Note

Definition at line 365 of file corpus_cleaner.cpp.

◆ LengthFilter()

void CorpusCleaner::LengthFilter ( Document & document)

Remove too long sentence and too short sentence.

Remove too long sentence that is length is more thanand too short sentence.
The length of too long sentence is more than "max_length".
The length of too short sentence is lesser than "min_length".

Parameters
Document&document: single line text to clean be cleaned
Returns
void: None
Note

Definition at line 309 of file corpus_cleaner.cpp.

◆ MinhashDeduplication()

void CorpusCleaner::MinhashDeduplication ( Document & document)

MinHashLSH Deduplication files in the this->intermediate folder.

Follow the steps below to remove duplication between all lines of all files in the this->intermediate folder.

  1. Get the list of files in this->intermediate_folder and set it to vector<string> file_list
  2. Compare all lines of source_file and target_file in file_list.
  3. Check duplication between all lines of souce file and all lines of target_file.
    Therefore, characters like πŸŒ€ cannot be matched using regular expressions.
    I considered deduplication using set or multiset, but I did not use this method because the file size could exceed the memory capacity.

Example:

Parameters
stringinput_folder_path: input folder path
stringoutput_folder_path: output folder path
Returns
Stats: statics imformation of this function. @note TODO: fix return stats.

Definition at line 569 of file corpus_cleaner.cpp.

◆ Normalizer()

void CorpusCleaner::Normalizer ( Document & document)

Neologd Normalize sentence.

Please Refer document of "NormalizeNeologd()".

Parameters
stringinput_path: The path of filterd file.
stringoutput_path: The output path of results file.
Returns
Stats: statics imformation of this function. https://github.com/neologd/mecab-ipadic-neologd/wiki/Regexp.ja#python-written-by-hideaki-t&ndash;overlast
Note

Definition at line 540 of file corpus_cleaner.cpp.

◆ PerplexityFilter()

void CorpusCleaner::PerplexityFilter ( Document & document)

KenLM's Perplexity Quality filtering.

Please Refer document of "TODO"


  1. If the perplexity is less than "threshold", the "document" is to be rejected.

Example:

Parameters
Document&document: single line text to be cleaned
Returns
void: None
Note

Definition at line 331 of file corpus_cleaner.cpp.

◆ PipelineStep()

Stats CorpusCleaner::PipelineStep ( Document & document,
void(CorpusCleaner::*)(Document &) cleaner )

PipelineStep.

Parameters
Document&document: document is to be filtered
void(CorpusCleaner::*cleaner)(Document &): filter function list
Returns
Stats: statics imformation of this function.

Definition at line 673 of file corpus_cleaner.cpp.

◆ QuotesRemover()

void CorpusCleaner::QuotesRemover ( Document & document)

Remove quotes. For example, [1], {245}, and so on.

Remove remarks matching regular expression.
The regular expression is "(\[([0-9]+)\]|\{([0-9]+)\})".

Parameters
Document&document: single line text to be cleaned
Returns
void: None
Attention
Don't use this on corpus that contain formulas or programs.

Definition at line 490 of file corpus_cleaner.cpp.

◆ SentenceSegmenter()

void CorpusCleaner::SentenceSegmenter ( string input_folder_path,
string output_folder_path )

Simple sentence splitter for japanese text.

I used Pragmatic Segmenter's Japanese rules as a reference for sentence separation rules.
The C++ regex library does not support 4-byte characters.
Therefore, characters like πŸŒ€ cannot be matched using regular expressions.
So, in a full search, those that completely match the pictogram are searched and removed.

Example: TODO

Parameters
stringinput_path: The path of filterd file.
stringoutput_path: The output path of results file.
Returns
Stats: statics imformation of this function. https://github.com/wwwcojp/ja_sentence_segmenter/blob/main/ja_sentence_segmenter/split/simple_splitter.py https://github.com/diasks2/pragmatic_segmenter#golden-rules-japanese
Note

Definition at line 617 of file corpus_cleaner.cpp.

◆ SpecialCharacterRemover()

void CorpusCleaner::SpecialCharacterRemover ( Document & document)

Remove special character. For example, β˜€, β™‘, β˜†, and so on.

Remove emoji characters that is \U00002600(β˜€) to \U000027ff(⟿),
\U00002190(←) to \U000021ff(β‡Ώ),\U00002300(βŒ€) to \U000023ff(⏿)
\U00002900(β€€) to \U0000297f(β₯Ώ),\U00002b00(⬀) to \U00002bff(β―Ώ),
and \U0001f000(πŸ€€) to \U0001f0ff(πŸƒΏ).
The C++ regex library does not support 4-byte characters.
Therefore, characters like πŸ€€ cannot be matched using regular expressions.
So, in a full search, those that completely match the pictogram are searched and removed.

Example: TODO.

Parameters
stringinput_path: The path of filterd file.
stringoutput_path: The output path of results file.
Returns
Stats: statics imformation of this function. https://guppy.eng.kagawa-u.ac.jp/OpenCampus/unicode.html
Note

Definition at line 435 of file corpus_cleaner.cpp.

◆ StoreException()

void CorpusCleaner::StoreException ( string function_name,
string reference )

Save exception in file.

Parameters
stringreference: reference infomation. For example, sentence.
stringfunction_name: function name cause exeption
Returns
None
Note

Definition at line 227 of file corpus_cleaner.cpp.

◆ URLRemover()

void CorpusCleaner::URLRemover ( Document & document)

Remove URLs matching regular expression.

Remove URLs matching regular expression.
The regular expression is "(https?|ftp)(:\/\/[-_\.!~*\'()a-zA-Z0-9;\/?:\@&=\+\$,%#]+)".

Parameters
Document&document: single line text to clean be cleaned
Returns
void: None
Note

Definition at line 407 of file corpus_cleaner.cpp.

◆ ZeroPunctuationFilter()

void CorpusCleaner::ZeroPunctuationFilter ( Document & document)

Remove sentence without punctuation.

Remove sentence without punctuation that is "、","ο½€","。","q",".",".","?","?","!","!".

Example:

Parameters
Document&document: single line text to be cleaned
Returns
void: None
Note
This filter is heuristic.
For example, a sentence that "https://github.com/" is not removed because it includes '.'.

Definition at line 512 of file corpus_cleaner.cpp.


The documentation for this class was generated from the following files: