|
Corpus Cleaner
|
#include <bits/stdc++.h>#include "language_filter.hpp"#include "perplexity_filter.hh"#include "minhash.hpp"Go to the source code of this file.
Classes | |
| struct | _DOCUMENT |
| Structure for storing statistical information for each process of CorpusCleaner. More... | |
| struct | _STATS |
| Structure for storing statistical information for each process of CorpusCleaner. More... | |
| class | CorpusCleaner |
Typedefs | |
| typedef struct _DOCUMENT | Document |
| Structure for storing statistical information for each process of CorpusCleaner. | |
| typedef struct _STATS | Stats |
| Structure for storing statistical information for each process of CorpusCleaner. | |
Functions | |
| void | ConvertTextToDocument (string sentence, string filename, string file_line_count, Document &document) |
| Convert input files to jsonl that has Document's element. | |
| void | ConvertInputFilesToJsonl (string input_folder_path, string output_folder_path) |
| Convert input files to jsonl that has Document's element. | |
| void | ReadDocumentFromJsonlOneLine (Document &document, string input_jsonl_line) |
| Loggging Document to output_file_path. | |
| void | WriteDocumentToJsonl (Document &document, string output_file_path) |
| Loggging Document to output_file_path. | |
| Stats | MakeStats (string process_name, string output_path, double elapsed_time) |
| Format statistics. | |
| void | OutputStats (Stats stats) |
| Output statistics. | |
Structure for storing statistical information for each process of CorpusCleaner.
Each process of CorpusCleaner obtains the following specific information.
Structure for storing statistical information for each process of CorpusCleaner.
Each process of CorpusCleaner obtains the following statistical information.
These will be used later for drawing processing, etc.
| void ConvertInputFilesToJsonl | ( | const string | input_folder_path, |
| const string | output_folder_path ) |
Convert input files to jsonl that has Document's element.
| Document | document: document |
| string | output_file_path: Path of file for statistics. |
Definition at line 135 of file corpus_cleaner.cpp.
| void ConvertTextToDocument | ( | string | sentence, |
| string | filename, | ||
| string | file_line_count, | ||
| Document & | document ) |
Convert input files to jsonl that has Document's element.
| string | sentence: sentence |
| string | filename: filename without file extention |
| string | file_line_count: the line number of sentence in "filename" |
| Document | document: document converted |
Definition at line 113 of file corpus_cleaner.cpp.
| Stats MakeStats | ( | string | process_name, |
| string | output_path, | ||
| double | elapsed_time ) |
Format statistics.
Example:
| string | process_name: Cleaning filter name. |
| string | output_path: Path of file for statistics. |
| double | elapsed_time: elapsed process time. |
Definition at line 186 of file corpus_cleaner.cpp.
| void OutputStats | ( | Stats | stats | ) |
Output statistics.
Example:
| Stats | stats: Statistics to be output. |
Definition at line 210 of file corpus_cleaner.cpp.
| void ReadDocumentFromJsonlOneLine | ( | Document & | document, |
| string | input_jsonl_line ) |
Loggging Document to output_file_path.
Example:
| Document | document: document |
| string | output_file_path: Path of file for statistics. |
Definition at line 26 of file corpus_cleaner.cpp.
| void WriteDocumentToJsonl | ( | Document & | document, |
| string | output_file_path ) |
Loggging Document to output_file_path.
| Document | document: document |
| string | output_file_path: Path of file for statistics. |
Definition at line 74 of file corpus_cleaner.cpp.