Corpus Cleaner
Classes | Typedefs | Functions
corpus_cleaner.hpp File Reference
#include <bits/stdc++.h>
#include "language_filter.hpp"
#include "perplexity_filter.hh"
#include "minhash.hpp"

Go to the source code of this file.

Classes

struct  _DOCUMENT
 Structure for storing statistical information for each process of CorpusCleaner. More...
 
struct  _STATS
 Structure for storing statistical information for each process of CorpusCleaner. More...
 
class  CorpusCleaner
 

Typedefs

typedef struct _DOCUMENT Document
 Structure for storing statistical information for each process of CorpusCleaner.
 
typedef struct _STATS Stats
 Structure for storing statistical information for each process of CorpusCleaner.
 

Functions

void ConvertTextToDocument (string sentence, string filename, string file_line_count, Document &document)
 Convert input files to jsonl that has Document's element.
 
void ConvertInputFilesToJsonl (string input_folder_path, string output_folder_path)
 Convert input files to jsonl that has Document's element.
 
void ReadDocumentFromJsonlOneLine (Document &document, string input_jsonl_line)
 Loggging Document to output_file_path.
 
void WriteDocumentToJsonl (Document &document, string output_file_path)
 Loggging Document to output_file_path.
 
Stats MakeStats (string process_name, string output_path, double elapsed_time)
 Format statistics.
 
void OutputStats (Stats stats)
 Output statistics.
 

Typedef Documentation

◆ Document

typedef struct _DOCUMENT Document

Structure for storing statistical information for each process of CorpusCleaner.

Each process of CorpusCleaner obtains the following specific information.

  • text: one sentence of corpus
  • id: text identification
  • is_rejected: True if this text is eligible for deletion
  • metadata: tags added during the filtering process
  • language: Language determined by LanguageFilter
  • language_score: Language score calculated by LanguageFilter
  • perplexity: perplexity calculated by PerplexityFilter These will be used later for drawing processing, etc.
    Note

◆ Stats

typedef struct _STATS Stats

Structure for storing statistical information for each process of CorpusCleaner.

Each process of CorpusCleaner obtains the following statistical information.

  1. CopusCleaner processing name
  2. Processed file name
  3. Elapsed processing time
  4. File size after processing

These will be used later for drawing processing, etc.

Note

Function Documentation

◆ ConvertInputFilesToJsonl()

void ConvertInputFilesToJsonl ( const string input_folder_path,
const string output_folder_path )

Convert input files to jsonl that has Document's element.

Parameters
Documentdocument: document
stringoutput_file_path: Path of file for statistics.
Returns
void: None
Note

Definition at line 135 of file corpus_cleaner.cpp.

◆ ConvertTextToDocument()

void ConvertTextToDocument ( string sentence,
string filename,
string file_line_count,
Document & document )

Convert input files to jsonl that has Document's element.

Parameters
stringsentence: sentence
stringfilename: filename without file extention
stringfile_line_count: the line number of sentence in "filename"
Documentdocument: document converted
Returns
void: None
Note

Definition at line 113 of file corpus_cleaner.cpp.

◆ MakeStats()

Stats MakeStats ( string process_name,
string output_path,
double elapsed_time )

Format statistics.

Example:

Parameters
stringprocess_name: Cleaning filter name.
stringoutput_path: Path of file for statistics.
doubleelapsed_time: elapsed process time.
Returns
Stats: statistics
Note

Definition at line 186 of file corpus_cleaner.cpp.

◆ OutputStats()

void OutputStats ( Stats stats)

Output statistics.

Example:

Parameters
Statsstats: Statistics to be output.
Returns
None
Note

Definition at line 210 of file corpus_cleaner.cpp.

◆ ReadDocumentFromJsonlOneLine()

void ReadDocumentFromJsonlOneLine ( Document & document,
string input_jsonl_line )

Loggging Document to output_file_path.

Example:

string input_file_path = "./input.jsonl";
ifstream ifs(input_file_path);
Document document;
string line = "";
while(getline(ifs,line)){
ReadDocumentFromJsonl(document,line);
// Write process for document.
}
Structure for storing statistical information for each process of CorpusCleaner.
Parameters
Documentdocument: document
stringoutput_file_path: Path of file for statistics.
Returns
void: None
Note

Definition at line 26 of file corpus_cleaner.cpp.

◆ WriteDocumentToJsonl()

void WriteDocumentToJsonl ( Document & document,
string output_file_path )

Loggging Document to output_file_path.

Parameters
Documentdocument: document
stringoutput_file_path: Path of file for statistics.
Returns
void: None
Note

Definition at line 74 of file corpus_cleaner.cpp.