Corpus Cleaner
Public Member Functions | List of all members
LSHDeduplicator Class Reference

#include <minhash.hpp>

Public Member Functions

 LSHDeduplicator (bool onlin_dedupe, string blacklist_path, bool store_blacklist, size_t total_backet_size_mb)
 
 ~LSHDeduplicator ()
 
bool Apply (const vector< string > *lshs)
 Calculate minhash list of text.
 
size_t SizeOfSeen (void)
 Calculate size of blacklist (rough estimate)
 
size_t SizeOfBlacklist (void)
 Calculate size of blacklist (rough estimate)
 
void InitializeSeen (void)
 Initialize seen parameter.
 
void StoreBlacklist (void)
 Save Blacklist to file.
 
void LoadBlacklistToSeen (void)
 Read Blacklist from file.
 
size_t GetTotalBucketSize (void)
 Read Blacklist from file.
 
void InitializeBlacklist (void)
 Initialize blacklist parameter.
 

Detailed Description

Definition at line 28 of file minhash.hpp.

Constructor & Destructor Documentation

◆ LSHDeduplicator()

LSHDeduplicator::LSHDeduplicator ( bool onlin_dedupe = true,
string blacklist_path = "",
bool store_blacklist = false,
size_t total_backet_size_mb = 5120 )

Definition at line 140 of file minhash.cpp.

◆ ~LSHDeduplicator()

LSHDeduplicator::~LSHDeduplicator ( )

Definition at line 155 of file minhash.cpp.

Member Function Documentation

◆ Apply()

bool LSHDeduplicator::Apply ( const vector< string > * lshs)

Calculate minhash list of text.

Duplication is determined based on the hash value generated by GenerateDedupLSH.
If the target corpus is approximately 10^6 or less (~several 10 GB is a rough guide),
duplicate processing is possible without prior processing.
If you want to perform deduplication without preprocessing (online), set the online_dedup flag to True.
The guaranteed value for online_dedup is True.

For larger corpora, it becomes difficult to store hash values ​​of all documents in memory.
Duplicate documents are considered to be a few percent of all documents, so
By reading only hash values ​​from a file as a blacklist, it is possible to process a corpus of several 100 GB.

Reads duplicate hash values ​​from the file specified by blacklist_path. blacklist files every other line Assume that the hash value generated by GenerateDedupLSH is recorded.
When the store_blacklist flag is set to True,
duplicate hash values ​​will be recorded as a set of strings in the LSHDeduplicator.blacklist attribute.
This option is useful, for example, when creating a blacklist hash value file.
The default value of the store_blacklist flag is False.

Example:

GenerateDedupLSH generate_dedup_lsh;
d1 = generate_dedup_lsh.CalculateLSH("Hello, World.");
d2 = generate_dedup_lsh.CalculateLSH("吾輩は猫である。名前はまだ無い。どこで生まれたかとんと見当がつかぬ。");
d3 = generate_dedup_lsh.CalculateLSH("吾輩は鳥である。名前はまだ無い。どこで生まれたかとんと見当がつかぬ。");
LSHDeduplicator deduplicator;
cout << deduplicator.Apply(d1) << endl;
//false
cout << deduplicator.Apply(d2) << endl;
//false
cout << deduplicator._deApply(d3) << endl;
//true
vector< string > CalculateLSH(wstring text)
Calculate minhash list of text.
Definition minhash.cpp:107
bool Apply(const vector< string > *lshs)
Calculate minhash list of text.
Definition minhash.cpp:200
Parameters
vector<string>*lshs: lshs list generated by GenerateDedupLSH
Returns
bool : duplication(true), or not duplication(false) https://github.com/HojiChar/HojiChar/blob/v0.9.0/hojichar/filters/deduplication.py
Note

Definition at line 200 of file minhash.cpp.

◆ GetTotalBucketSize()

size_t LSHDeduplicator::GetTotalBucketSize ( void )

Read Blacklist from file.

Parameters
voidNone
Returns
size_t : size of blacklist
Note

Definition at line 381 of file minhash.cpp.

◆ InitializeBlacklist()

void LSHDeduplicator::InitializeBlacklist ( void )

Initialize blacklist parameter.

Example:

Parameters
voidNone
Returns
size_t : size of seen
Note

Definition at line 331 of file minhash.cpp.

◆ InitializeSeen()

void LSHDeduplicator::InitializeSeen ( void )

Initialize seen parameter.

Caluclate size of seen. The step is following. Example: GenerateDedupLSH generate_dedupe_lsh(3);

Parameters
voidNone
Returns
size_t : size of seen
Note

Definition at line 300 of file minhash.cpp.

◆ LoadBlacklistToSeen()

void LSHDeduplicator::LoadBlacklistToSeen ( void )

Read Blacklist from file.

Parameters
voidNone
Returns
size_t : size of blacklist
Note

Definition at line 361 of file minhash.cpp.

◆ SizeOfBlacklist()

size_t LSHDeduplicator::SizeOfBlacklist ( void )

Calculate size of blacklist (rough estimate)

Parameters
voidNone
Returns
size_t : size of blacklist
Note

Definition at line 265 of file minhash.cpp.

◆ SizeOfSeen()

size_t LSHDeduplicator::SizeOfSeen ( void )

Calculate size of blacklist (rough estimate)

Parameters
voidNone
Returns
size_t : size of blacklist
Note

Definition at line 232 of file minhash.cpp.

◆ StoreBlacklist()

void LSHDeduplicator::StoreBlacklist ( void )

Save Blacklist to file.

Parameters
voidNone
Returns
size_t : size of blacklist
Note

Definition at line 344 of file minhash.cpp.


The documentation for this class was generated from the following files: