Corpus Cleaner
Public Member Functions | Public Attributes | List of all members
KenLMFilter Class Reference

#include <perplexity_filter.hh>

Public Member Functions

 KenLMFilter ()
 Score sentence by KenLM.
 
double Score (const wstring sentence)
 Score sentence by KenLM.
 
double ScoreWithSentencePiece (const wstring sentence)
 Score sentence by KenLM with SentencePiece Tokenizing.
 
double Perplexity (const wstring sentence)
 Perplexity sentence by KenLM.
 
double PerplexityWithSentencePiece (const wstring sentence)
 Perplexity sentence by KenLM with SentencePiece Tokenizing.
 

Public Attributes

sentencepiece::SentencePieceProcessor processor
 

Detailed Description

Definition at line 33 of file perplexity_filter.hh.

Constructor & Destructor Documentation

◆ KenLMFilter()

KenLMFilter::KenLMFilter ( )

Score sentence by KenLM.

The step is...

  1. Split sentence into single characters.

Example:

wstring sentence = L"吾輩は猫である.名前はまだない.";
cout << KenLMScore(sentence) << endl;
// -60.5849
Parameters
constwstring &sentence: text sentence
Returns
double: score by KenLM
Note

Definition at line 27 of file perplexity_filter.cc.

Member Function Documentation

◆ Perplexity()

double KenLMFilter::Perplexity ( const wstring sentence)

Perplexity sentence by KenLM.

The step is...

  1. Split sentence into single characters.

Example: wstring sentence = L"吾輩は猫である.名前はまだない."; cout << KenLMPerplexity(sentence) << endl; // 4117.1

Parameters
conststring& src: text sentence
Returns
double: score by KenLM https://github.com/kpu/kenlm/blob/master/python/kenlm.pyx#L209 https://zenn.dev/syoyo/articles/529ce949121ca4 https://github.com/facebookresearch/cc_net
Note

Definition at line 147 of file perplexity_filter.cc.

◆ PerplexityWithSentencePiece()

double KenLMFilter::PerplexityWithSentencePiece ( const wstring sentence)

Perplexity sentence by KenLM with SentencePiece Tokenizing.

The step is...

  1. Split sentence into token by SentencePiece.
  2. Calculate perplexity value.

The usage is following.

wstring sentence = L"吾輩は猫である.名前はまだない."; cout << PerplexityWithSentencePiece(sentence) << endl; // 677.5

Parameters
conststring& src: text sentence
Returns
double: score by KenLM https://github.com/kpu/kenlm/blob/master/python/kenlm.pyx#L209 https://zenn.dev/syoyo/articles/529ce949121ca4 https://github.com/facebookresearch/cc_net
Note

Definition at line 176 of file perplexity_filter.cc.

◆ Score()

double KenLMFilter::Score ( const wstring sentence)

Score sentence by KenLM.

The step is...

  1. Split sentence into single characters.

Example:

wstring sentence = L"吾輩は猫である.名前はまだない.";
cout << KenLMScore(sentence) << endl;
// -60.5849
Parameters
constwstring &sentence: text sentence
Returns
double: score by KenLM https://github.com/google/sentencepiece/blob/master/doc/api.md https://github.com/google/sentencepiece
Note

Definition at line 57 of file perplexity_filter.cc.

◆ ScoreWithSentencePiece()

double KenLMFilter::ScoreWithSentencePiece ( const wstring sentence)

Score sentence by KenLM with SentencePiece Tokenizing.

The step is...

  1. Split sentence into single characters.

Example:

wstring sentence = L"吾輩は猫である.名前はまだない.";
cout << KenLMScore(sentence) << endl;
//
Parameters
constwstring &sentence: text sentence
Returns
double: score by KenLM https://github.com/google/sentencepiece/blob/master/doc/api.md https://github.com/google/sentencepiece
Note

Definition at line 102 of file perplexity_filter.cc.

Member Data Documentation

◆ processor

sentencepiece::SentencePieceProcessor KenLMFilter::processor

Definition at line 36 of file perplexity_filter.hh.


The documentation for this class was generated from the following files: