|
Corpus Cleaner
|
Go to the source code of this file.
Functions | |
| void | ReadDocumentFromJsonlOneLine (Document &document, string input_jsonl_line) |
| Loggging Document to output_file_path. | |
| void | WriteDocumentToJsonl (Document &document, string output_file_path) |
| Loggging Document to output_file_path. | |
| void | ConvertTextToDocument (string sentence, string filename, string file_line_count, Document &document) |
| Convert input files to jsonl that has Document's element. | |
| void | ConvertInputFilesToJsonl (const string input_folder_path, const string output_folder_path) |
| Convert input files to jsonl that has Document's element. | |
| Stats | MakeStats (string process_name, string output_path, double elapsed_time) |
| Format statistics. | |
| void | OutputStats (Stats stats) |
| Output statistics. | |
| void ConvertInputFilesToJsonl | ( | const string | input_folder_path, |
| const string | output_folder_path ) |
Convert input files to jsonl that has Document's element.
| Document | document: document |
| string | output_file_path: Path of file for statistics. |
Definition at line 135 of file corpus_cleaner.cpp.
| void ConvertTextToDocument | ( | string | sentence, |
| string | filename, | ||
| string | file_line_count, | ||
| Document & | document ) |
Convert input files to jsonl that has Document's element.
| string | sentence: sentence |
| string | filename: filename without file extention |
| string | file_line_count: the line number of sentence in "filename" |
| Document | document: document converted |
Definition at line 113 of file corpus_cleaner.cpp.
| Stats MakeStats | ( | string | process_name, |
| string | output_path, | ||
| double | elapsed_time ) |
Format statistics.
Example:
| string | process_name: Cleaning filter name. |
| string | output_path: Path of file for statistics. |
| double | elapsed_time: elapsed process time. |
Definition at line 186 of file corpus_cleaner.cpp.
| void OutputStats | ( | Stats | stats | ) |
Output statistics.
Example:
| Stats | stats: Statistics to be output. |
Definition at line 210 of file corpus_cleaner.cpp.
| void ReadDocumentFromJsonlOneLine | ( | Document & | document, |
| string | input_jsonl_line ) |
Loggging Document to output_file_path.
Example:
| Document | document: document |
| string | output_file_path: Path of file for statistics. |
Definition at line 26 of file corpus_cleaner.cpp.
| void WriteDocumentToJsonl | ( | Document & | document, |
| string | output_file_path ) |
Loggging Document to output_file_path.
| Document | document: document |
| string | output_file_path: Path of file for statistics. |
Definition at line 74 of file corpus_cleaner.cpp.