Corpus Cleaner
Functions
corpus_cleaner.cpp File Reference
#include "corpus_cleaner.hpp"
#include "util.hpp"
#include "normalizer.hpp"
#include "simdjson.h"

Go to the source code of this file.

Functions

void ReadDocumentFromJsonlOneLine (Document &document, string input_jsonl_line)
 Loggging Document to output_file_path.
 
void WriteDocumentToJsonl (Document &document, string output_file_path)
 Loggging Document to output_file_path.
 
void ConvertTextToDocument (string sentence, string filename, string file_line_count, Document &document)
 Convert input files to jsonl that has Document's element.
 
void ConvertInputFilesToJsonl (const string input_folder_path, const string output_folder_path)
 Convert input files to jsonl that has Document's element.
 
Stats MakeStats (string process_name, string output_path, double elapsed_time)
 Format statistics.
 
void OutputStats (Stats stats)
 Output statistics.
 

Function Documentation

◆ ConvertInputFilesToJsonl()

void ConvertInputFilesToJsonl ( const string input_folder_path,
const string output_folder_path )

Convert input files to jsonl that has Document's element.

Parameters
Documentdocument: document
stringoutput_file_path: Path of file for statistics.
Returns
void: None
Note

Definition at line 135 of file corpus_cleaner.cpp.

◆ ConvertTextToDocument()

void ConvertTextToDocument ( string sentence,
string filename,
string file_line_count,
Document & document )

Convert input files to jsonl that has Document's element.

Parameters
stringsentence: sentence
stringfilename: filename without file extention
stringfile_line_count: the line number of sentence in "filename"
Documentdocument: document converted
Returns
void: None
Note

Definition at line 113 of file corpus_cleaner.cpp.

◆ MakeStats()

Stats MakeStats ( string process_name,
string output_path,
double elapsed_time )

Format statistics.

Example:

Parameters
stringprocess_name: Cleaning filter name.
stringoutput_path: Path of file for statistics.
doubleelapsed_time: elapsed process time.
Returns
Stats: statistics
Note

Definition at line 186 of file corpus_cleaner.cpp.

◆ OutputStats()

void OutputStats ( Stats stats)

Output statistics.

Example:

Parameters
Statsstats: Statistics to be output.
Returns
None
Note

Definition at line 210 of file corpus_cleaner.cpp.

◆ ReadDocumentFromJsonlOneLine()

void ReadDocumentFromJsonlOneLine ( Document & document,
string input_jsonl_line )

Loggging Document to output_file_path.

Example:

string input_file_path = "./input.jsonl";
ifstream ifs(input_file_path);
Document document;
string line = "";
while(getline(ifs,line)){
ReadDocumentFromJsonl(document,line);
// Write process for document.
}
Structure for storing statistical information for each process of CorpusCleaner.
Parameters
Documentdocument: document
stringoutput_file_path: Path of file for statistics.
Returns
void: None
Note

Definition at line 26 of file corpus_cleaner.cpp.

◆ WriteDocumentToJsonl()

void WriteDocumentToJsonl ( Document & document,
string output_file_path )

Loggging Document to output_file_path.

Parameters
Documentdocument: document
stringoutput_file_path: Path of file for statistics.
Returns
void: None
Note

Definition at line 74 of file corpus_cleaner.cpp.