Overview

Welcome to my repository!
This repository is a C++ library includes quality filtering, deduplication, and unnecessary vocabulary removal for Japanese corpus.
The features are following.

Normalizer: Sentence normalization created by mecab-neologd
URL Remover: Remove URLs matching regular expression
Special Characters Remover: Remove certain special characters (☀, ♡, ☆, etc.)
Emoji Remover: Remove emoji characters that is \U0001F300 to \U0001F9FF.
Quotes Remover: Remove quotes ([1], {245})
Length Filter: Remove too long sentence and too short sentence
Language Filter: Determine whether it is a Japanese document
Minhash Deduplicator: Deduplication using Minhash
ZeroPunctuationFilter: Delete documents without punctuation
Sentence Segmenter: Divide the corpus into sentences based on rules
Perplexity Filter: Perplexity filter using kENLM

Getting Started

Clone Repository

git clone https://github.com/ce-lery/corpus-cleaner.git

cd corpus-cleaner

Install Step

Docker

Build a python environment using Docker files.

docker build -t corpus-cleaner-image ./

docker run -v ./:/home/corpus-cleaner/ -it --gpus all corpus-cleaner-image

Other (Local Install)

sudo apt upgrade

sudo apt-get install cmake gdb libboost-system-dev libboost-thread-dev libboost-program-options-dev libboost-test-dev libeigen3-dev zlib1g-dev libbz2-dev liblzma-dev pkg-config libgoogle-perftools-dev curl wget build-essential nano flex bison

Common Step

Run the shell script with the following command.
In this script, you install third party library.

bash scripts/setup.sh

Build source code of corpus-cleaner.

bash scripts/build.sh

Please place the files to be cleaned in "./results/data/original". The file format is ".txt". For example, "wiki.txt", "cc100_train.txt", and so on.

mkdir -p results/data/original/

# Please place the files to be cleaned in "./results/data/original".

Run corpus_cleaner. Please wait a minute.

bash scripts/run.sh

The cleaning result files will be created in "results/data/cleaned".
The file format is jsonl.

Specification

If you want to see this tool's specification and API references, please refer here.

Usage

Basic Usage

The basic usage of corpus-cleaner is same as Getting Started.

Select Filtering Feature

If you want to disable Sentence Segmenter, please set "bool sentence_segment=false", and create instance of CorpusCleaner class.

CorpusCleaner corpus_cleaner(input_folder_path,
                             output_folder_path,
                             min_length,
                             max_length,
                             accept_language,
                             store_rejected,
                             execute_sentence_segment, // <--------- switch here to false
                             language_threshold,
                             perplexity_threshold,
                             &generate_dedup_lsh,
                             &deduplicator);

If you want to disable filter, please Comment out the corresponding filter function in the variable TODO:.

int32_t CorpusCleaner::CleanPipeline(void)
{
    // Set CorpusCleaner process that will be executed.
    // They will be executed in the order you set them.
    vector<void (CorpusCleaner::*)(Document &)> cleaner_list = { 
        &CorpusCleaner::Normalizer,
        &CorpusCleaner::URLRemover,
        &CorpusCleaner::EmojiRemover, 
        # &CorpusCleaner::SpecialCharacterRemover,
        # &CorpusCleaner::QuotesRemover,   // <- If you comment or exclude function of 
        # &CorpusCleaner::LengthFilter,    // <- cleaner_list, the functions are disabled.
        # &CorpusCleaner::ZeroPunctuationFilter,
        &CorpusCleaner::LanguageFilter,
        &CorpusCleaner::MinhashDeduplication,
        &CorpusCleaner::PerplexityFilter,
    }; 
    // ~ ommit ~
}

Maybe, this step will be changed in the future.

License

This repository is licensed under the Apache License, Version2.0.
Please refer [LICENSE](LICENSE) file.

Third Party Library

In this repository, I use following third party library.
Please note the lisence.

Library	License	Purpose
icu	UNICODE LICENSE V3	For NFKC normalization of Normalizer.
kenlm	LGPL license	For perplexity filtering. Since I have not embedded this tool in this repository (installed it when I use it), I think that this repository is not covered by the LGPL license.
SentencePiece	Apache-2.0 license	For tokenization in perplexity filtering.
smhasher	MIT licensed.	For hash value generation for Mihash processing.
simdjson	Apache-2.0 license	For jsonl parsing.
fastText	MIT license	For language filtering.
GoogleTest	BSD-3-Clause license	For test.
doxygen	(GPL-2.0 license)	For Documentation. This license does not apply to works produced by doxygen

Test

bash scripts/test.sh

Contribution

We welcome your contributions to this repository. To contribute, please see CONTRIBUTING.md.

TODO

ver.0.1.0

[ ] Write document & create doxygen

ver.0.2.0

[ ] Set Github Action's CI/CD（build）
[ ] Implement pybind & python code
[ ] Implement loader json & jsonl
[ ] Morphological analysis (by jagger)
[ ] Remove ad header and footer
[ ] Remove HTML mark
[ ] Implement dump .txt format file(only is_removed=false).
[ ] Remove repeated expressions

ver.0.3.0

[ ] Speedup?