|
Corpus Cleaner
|
#include <corpus_cleaner.hpp>
Public Member Functions | |
| CorpusCleaner (string input_path, string output_path, uint32_t min_length, uint32_t max_length, set< string > accept_language, bool store_rejected, bool sentence_segment, float language_threshold, double perplexity_threshold, GenerateDedupLSH *generate_dedup_lsh, LSHDeduplicator *deduplicator) | |
| ~CorpusCleaner () | |
| void | Normalizer (Document &document) |
| Neologd Normalize sentence. | |
| void | URLRemover (Document &document) |
| Remove URLs matching regular expression. | |
| void | SpecialCharacterRemover (Document &document) |
| Remove special character. For example, β, β‘, β, and so on. | |
| void | EmojiRemover (Document &document) |
| Remove emoji. For example, π€, π, π, and so on. | |
| void | QuotesRemover (Document &document) |
| Remove quotes. For example, [1], {245}, and so on. | |
| void | LengthFilter (Document &document) |
| Remove too long sentence and too short sentence. | |
| void | LanguageFilter (Document &document) |
| Language filtering using fastText. | |
| void | PerplexityFilter (Document &document) |
| KenLM's Perplexity Quality filtering. | |
| void | MinhashDeduplication (Document &document) |
| MinHashLSH Deduplication files in the this->intermediate folder. | |
| void | ZeroPunctuationFilter (Document &document) |
| Remove sentence without punctuation. | |
| void | SentenceSegmenter (string input_folder_path, string output_folder_path) |
| Simple sentence splitter for japanese text. | |
| Stats | PipelineStep (Document &document, void(CorpusCleaner::*cleaner)(Document &)) |
| PipelineStep. | |
| int32_t | CleanPipeline (void) |
| Pipeline that sequentially executes the configured CorpusCleaner methods. | |
| void | StoreException (string function_name, string reference) |
| Save exception in file. | |
Definition at line 64 of file corpus_cleaner.hpp.
| CorpusCleaner::CorpusCleaner | ( | string | input_path, |
| string | output_path, | ||
| uint32_t | min_length, | ||
| uint32_t | max_length, | ||
| set< string > | accept_language, | ||
| bool | store_rejected, | ||
| bool | sentence_segment, | ||
| float | language_threshold, | ||
| double | perplexity_threshold, | ||
| GenerateDedupLSH * | generate_dedup_lsh, | ||
| LSHDeduplicator * | deduplicator ) |
Definition at line 239 of file corpus_cleaner.cpp.
| CorpusCleaner::~CorpusCleaner | ( | ) |
Definition at line 290 of file corpus_cleaner.cpp.
| int32_t CorpusCleaner::CleanPipeline | ( | void | ) |
Pipeline that sequentially executes the configured CorpusCleaner methods.
Perform the following steps in order.
Example:
| void | None |
Definition at line 751 of file corpus_cleaner.cpp.
| void CorpusCleaner::EmojiRemover | ( | Document & | document | ) |
Remove emoji. For example, π€, π, π, and so on.
Remove emoji characters that is \U0001F300(π) to \U0001F9FF(π§Ώ).
The C++ regex library does not support 4-byte characters.
Therefore, characters like π cannot be matched using regular expressions.
So, in a full search, those that completely match the pictogram are searched and removed.
| Document | &document: single line text to be cleaned |
Definition at line 467 of file corpus_cleaner.cpp.
| void CorpusCleaner::LanguageFilter | ( | Document & | document | ) |
Language filtering using fastText.
| Document | &document: single line text to be cleaned |
Definition at line 365 of file corpus_cleaner.cpp.
| void CorpusCleaner::LengthFilter | ( | Document & | document | ) |
Remove too long sentence and too short sentence.
Remove too long sentence that is length is more thanand too short sentence.
The length of too long sentence is more than "max_length".
The length of too short sentence is lesser than "min_length".
| Document | &document: single line text to clean be cleaned |
Definition at line 309 of file corpus_cleaner.cpp.
| void CorpusCleaner::MinhashDeduplication | ( | Document & | document | ) |
MinHashLSH Deduplication files in the this->intermediate folder.
Follow the steps below to remove duplication between all lines of all files in the this->intermediate folder.
Example:
| string | input_folder_path: input folder path |
| string | output_folder_path: output folder path |
Definition at line 569 of file corpus_cleaner.cpp.
| void CorpusCleaner::Normalizer | ( | Document & | document | ) |
Neologd Normalize sentence.
Please Refer document of "NormalizeNeologd()".
| string | input_path: The path of filterd file. |
| string | output_path: The output path of results file. |
Definition at line 540 of file corpus_cleaner.cpp.
| void CorpusCleaner::PerplexityFilter | ( | Document & | document | ) |
KenLM's Perplexity Quality filtering.
Please Refer document of "TODO"
Example:
| Document | &document: single line text to be cleaned |
Definition at line 331 of file corpus_cleaner.cpp.
| Stats CorpusCleaner::PipelineStep | ( | Document & | document, |
| void(CorpusCleaner::*)(Document &) | cleaner ) |
PipelineStep.
| Document | &document: document is to be filtered |
| void | (CorpusCleaner::*cleaner)(Document &): filter function list |
Definition at line 673 of file corpus_cleaner.cpp.
| void CorpusCleaner::QuotesRemover | ( | Document & | document | ) |
Remove quotes. For example, [1], {245}, and so on.
Remove remarks matching regular expression.
The regular expression is "(\[([0-9]+)\]|\{([0-9]+)\})".
| Document | &document: single line text to be cleaned |
Definition at line 490 of file corpus_cleaner.cpp.
| void CorpusCleaner::SentenceSegmenter | ( | string | input_folder_path, |
| string | output_folder_path ) |
Simple sentence splitter for japanese text.
I used Pragmatic Segmenter's Japanese rules as a reference for sentence separation rules.
The C++ regex library does not support 4-byte characters.
Therefore, characters like π cannot be matched using regular expressions.
So, in a full search, those that completely match the pictogram are searched and removed.
Example: TODO
| string | input_path: The path of filterd file. |
| string | output_path: The output path of results file. |
Definition at line 617 of file corpus_cleaner.cpp.
| void CorpusCleaner::SpecialCharacterRemover | ( | Document & | document | ) |
Remove special character. For example, β, β‘, β, and so on.
Remove emoji characters that is \U00002600(β) to \U000027ff(βΏ),
\U00002190(β) to \U000021ff(βΏ),\U00002300(β) to \U000023ff(βΏ)
\U00002900(β€) to \U0000297f(β₯Ώ),\U00002b00(β¬) to \U00002bff(β―Ώ),
and \U0001f000(π) to \U0001f0ff(πΏ).
The C++ regex library does not support 4-byte characters.
Therefore, characters like π cannot be matched using regular expressions.
So, in a full search, those that completely match the pictogram are searched and removed.
Example: TODO.
| string | input_path: The path of filterd file. |
| string | output_path: The output path of results file. |
Definition at line 435 of file corpus_cleaner.cpp.
| void CorpusCleaner::StoreException | ( | string | function_name, |
| string | reference ) |
Save exception in file.
| string | reference: reference infomation. For example, sentence. |
| string | function_name: function name cause exeption |
Definition at line 227 of file corpus_cleaner.cpp.
| void CorpusCleaner::URLRemover | ( | Document & | document | ) |
Remove URLs matching regular expression.
Remove URLs matching regular expression.
The regular expression is "(https?|ftp)(:\/\/[-_\.!~*\'()a-zA-Z0-9;\/?:\@&=\+\$,%#]+)".
| Document | &document: single line text to clean be cleaned |
Definition at line 407 of file corpus_cleaner.cpp.
| void CorpusCleaner::ZeroPunctuationFilter | ( | Document & | document | ) |
Remove sentence without punctuation.
Remove sentence without punctuation that is "γ","ο½€","γ","q","οΌ",".","οΌ","?","οΌ","!".
Example:
| Document | &document: single line text to be cleaned |
Definition at line 512 of file corpus_cleaner.cpp.