Overview
Welcome to my repository!
This repository is a C++ library includes quality filtering, deduplication, and unnecessary vocabulary removal for Japanese corpus.
The features are following.
- Normalizer: Sentence normalization created by mecab-neologd
- URL Remover: Remove URLs matching regular expression
- Special Characters Remover: Remove certain special characters (☀, ♡, ☆, etc.)
- Emoji Remover: Remove emoji characters that is \U0001F300 to \U0001F9FF.
- Quotes Remover: Remove quotes ([1], {245})
- Length Filter: Remove too long sentence and too short sentence
- Language Filter: Determine whether it is a Japanese document
- Minhash Deduplicator: Deduplication using Minhash
- ZeroPunctuationFilter: Delete documents without punctuation
- Sentence Segmenter: Divide the corpus into sentences based on rules
- Perplexity Filter: Perplexity filter using kENLM
Getting Started
Clone Repository
git clone https://github.com/ce-lery/corpus-cleaner.git
cd corpus-cleaner
Install Step
Docker
Build a python environment using Docker files.
docker build -t corpus-cleaner-image ./
docker run -v ./:/home/corpus-cleaner/ -it --gpus all corpus-cleaner-image
Other (Local Install)
sudo apt upgrade
sudo apt-get install cmake gdb libboost-system-dev libboost-thread-dev libboost-program-options-dev libboost-test-dev libeigen3-dev zlib1g-dev libbz2-dev liblzma-dev pkg-config libgoogle-perftools-dev curl wget build-essential nano flex bison
Common Step
Run the shell script with the following command.
In this script, you install third party library.
Build source code of corpus-cleaner.
Please place the files to be cleaned in "./results/data/original". The file format is ".txt". For example, "wiki.txt", "cc100_train.txt", and so on.
mkdir -p results/data/original/
# Please place the files to be cleaned in "./results/data/original".
Run corpus_cleaner. Please wait a minute.
The cleaning result files will be created in "results/data/cleaned".
The file format is jsonl.
Specification
If you want to see this tool's specification and API references, please refer here.
Usage
Basic Usage
The basic usage of corpus-cleaner is same as Getting Started.
Select Filtering Feature
If you want to disable Sentence Segmenter, please set "bool sentence_segment=false", and create instance of CorpusCleaner class.
output_folder_path,
min_length,
max_length,
accept_language,
store_rejected,
execute_sentence_segment,
language_threshold,
perplexity_threshold,
&generate_dedup_lsh,
&deduplicator);
If you want to disable filter, please Comment out the corresponding filter function in the variable TODO:.
{
# &CorpusCleaner::SpecialCharacterRemover,
# &CorpusCleaner::QuotesRemover,
# &CorpusCleaner::LengthFilter,
# &CorpusCleaner::ZeroPunctuationFilter,
};
}
void Normalizer(Document &document)
Neologd Normalize sentence.
void PerplexityFilter(Document &document)
KenLM's Perplexity Quality filtering.
void MinhashDeduplication(Document &document)
MinHashLSH Deduplication files in the this->intermediate folder.
int32_t CleanPipeline(void)
Pipeline that sequentially executes the configured CorpusCleaner methods.
void LanguageFilter(Document &document)
Language filtering using fastText.
void EmojiRemover(Document &document)
Remove emoji. For example, 🤗, 🐉, 📊, and so on.
void URLRemover(Document &document)
Remove URLs matching regular expression.
Structure for storing statistical information for each process of CorpusCleaner.
Maybe, this step will be changed in the future.
License
This repository is licensed under the Apache License, Version2.0.
Please refer [LICENSE](LICENSE) file.
Third Party Library
In this repository, I use following third party library.
Please note the lisence.
| Library | License | Purpose |
| icu | UNICODE LICENSE V3 | For NFKC normalization of Normalizer. |
| kenlm | LGPL license | For perplexity filtering.
Since I have not embedded this tool in this repository (installed it when I use it),
I think that this repository is not covered by the LGPL license. |
| SentencePiece | Apache-2.0 license | For tokenization in perplexity filtering. |
| smhasher | MIT licensed. | For hash value generation for Mihash processing. |
| simdjson | Apache-2.0 license | For jsonl parsing. |
| fastText | MIT license | For language filtering. |
| GoogleTest | BSD-3-Clause license | For test. |
| doxygen | (GPL-2.0 license) | For Documentation.
This license does not apply to works produced by doxygen |
Test
Contribution
We welcome your contributions to this repository. To contribute, please see CONTRIBUTING.md.
TODO
ver.0.1.0
- [ ] Write document & create doxygen
ver.0.2.0
- [ ] Set Github Action's CI/CD(build)
- [ ] Implement pybind & python code
- [ ] Implement loader json & jsonl
- [ ] Morphological analysis (by jagger)
- [ ] Remove ad header and footer
- [ ] Remove HTML mark
- [ ] Implement dump .txt format file(only is_removed=false).
- [ ] Remove repeated expressions
ver.0.3.0