|
Corpus Cleaner
|
This tool assumes the following folder structure.
Please place the dataset files to "/results/dataset/original/".
Example:
The contents of each number folder are as follows.
After running this tool, the files in the "results/dataset/original" folder will be divided into multiple parts (in the example below, 8 parts from 0 to 7) and each will be copied to the "dataset/i/input" folder.
The breakdown of the numbered folders is as follows.
First, the contents of the input folder are copied to the intermediate folder.
Note that if the cleaned file already exists, processing after the constructor of the CorpusCleaner class will not proceed to prevent it from being overwritten.
The following warning appears on the console, so move or delete the relevant file to another directory, and then try the process again.
This tool takes a .txt file as input. Each .txt file is expected to contain one data series per line.
The each line format is following.
This tool is output cleaned file that is jsonl format.
The each line format is following.
Example:
In this chapter, I explain corpus cleaner feature that is CorpusCleaner class.
TODO: write about
TODO: write about using memory resource
For usage examples of each function, please refer to the API specifications.
| Features | Details | Remarks |
|---|---|---|
| Normalizer | Perform the normalization process described in the link below. Neologdn's normalization contents include, for example, "replace half-width katakana with full-width" and "replace full-width spaces with half-width spaces." | See here for details on normalization contents. |
| URLRemover | Remove URLs matching regular expression that is "(https?\|ftp)(:\/\/[-_\.!~*\'()a-zA-Z0-9;\/?:\@&=\+\$,%#]+)". | |
| SpecialCharactersRemover | Remove special character. For example, ☀, ♡, ☆, and so on. Removes special characters within a specific Unicode range. | Please refer this URL. Special characters to be removed include some emojis. |
| EmojiRemover | Remove emoji characters that is \U0001F300(🌀) to \U0001F9FF(🧿). | |
| QuotesRemover | Remove quotes. For example, [1], {245}, and so on. | |
| LengthFilter | Remove too long sentence and too short sentence. | |
| LanguageFilter | Classify sentence composition language and quality using fastText. This function is applied to japanese and english. | This function uses fastText. About fastText, please refer here. |
| MinhashDeduplicater | Deduplicate sentence using minhash. SentencePiece is used when tokenizing sentences. | |
| ZeroPunctuationFilter | Remove sentence without punctuation that is "、","、","。","。",".",".","?","?","!","!". | |
| PerplexityFilter | KenLM's Perplexity Quality filtering. | |
| SentenceSegmenter | Execute Rule-based sentence separation (sentence segmentation) using ja_sentence_segmenter. | Please refer this article. |
In main.cc, we create 8 (fixed value) processes and each process processes a different file to speed it up.
In this repository, I do test using GoogleTest. Test folder is tests/, and test file is CorpusCleaner_test.cpp.
If you want to do test, Please execute following command.
When you do the command, the results of test is output.