|
Corpus Cleaner
|
#include <minhash.hpp>
Public Member Functions | |
| LSHDeduplicator (bool onlin_dedupe, string blacklist_path, bool store_blacklist, size_t total_backet_size_mb) | |
| ~LSHDeduplicator () | |
| bool | Apply (const vector< string > *lshs) |
| Calculate minhash list of text. | |
| size_t | SizeOfSeen (void) |
| Calculate size of blacklist (rough estimate) | |
| size_t | SizeOfBlacklist (void) |
| Calculate size of blacklist (rough estimate) | |
| void | InitializeSeen (void) |
| Initialize seen parameter. | |
| void | StoreBlacklist (void) |
| Save Blacklist to file. | |
| void | LoadBlacklistToSeen (void) |
| Read Blacklist from file. | |
| size_t | GetTotalBucketSize (void) |
| Read Blacklist from file. | |
| void | InitializeBlacklist (void) |
| Initialize blacklist parameter. | |
Definition at line 28 of file minhash.hpp.
| LSHDeduplicator::LSHDeduplicator | ( | bool | onlin_dedupe = true, |
| string | blacklist_path = "", | ||
| bool | store_blacklist = false, | ||
| size_t | total_backet_size_mb = 5120 ) |
Definition at line 140 of file minhash.cpp.
| LSHDeduplicator::~LSHDeduplicator | ( | ) |
Definition at line 155 of file minhash.cpp.
| bool LSHDeduplicator::Apply | ( | const vector< string > * | lshs | ) |
Calculate minhash list of text.
Duplication is determined based on the hash value generated by GenerateDedupLSH.
If the target corpus is approximately 10^6 or less (~several 10 GB is a rough guide),
duplicate processing is possible without prior processing.
If you want to perform deduplication without preprocessing (online), set the online_dedup flag to True.
The guaranteed value for online_dedup is True.
For larger corpora, it becomes difficult to store hash values of all documents in memory.
Duplicate documents are considered to be a few percent of all documents, so
By reading only hash values from a file as a blacklist, it is possible to process a corpus of several 100 GB.
Reads duplicate hash values from the file specified by blacklist_path. blacklist files every other line Assume that the hash value generated by GenerateDedupLSH is recorded.
When the store_blacklist flag is set to True,
duplicate hash values will be recorded as a set of strings in the LSHDeduplicator.blacklist attribute.
This option is useful, for example, when creating a blacklist hash value file.
The default value of the store_blacklist flag is False.
Example:
| vector<string> | *lshs: lshs list generated by GenerateDedupLSH |
Definition at line 200 of file minhash.cpp.
| size_t LSHDeduplicator::GetTotalBucketSize | ( | void | ) |
Read Blacklist from file.
| void | None |
Definition at line 381 of file minhash.cpp.
| void LSHDeduplicator::InitializeBlacklist | ( | void | ) |
Initialize blacklist parameter.
Example:
| void | None |
Definition at line 331 of file minhash.cpp.
| void LSHDeduplicator::InitializeSeen | ( | void | ) |
Initialize seen parameter.
Caluclate size of seen. The step is following. Example: GenerateDedupLSH generate_dedupe_lsh(3);
| void | None |
Definition at line 300 of file minhash.cpp.
| void LSHDeduplicator::LoadBlacklistToSeen | ( | void | ) |
Read Blacklist from file.
| void | None |
Definition at line 361 of file minhash.cpp.
| size_t LSHDeduplicator::SizeOfBlacklist | ( | void | ) |
Calculate size of blacklist (rough estimate)
| void | None |
Definition at line 265 of file minhash.cpp.
| size_t LSHDeduplicator::SizeOfSeen | ( | void | ) |
Calculate size of blacklist (rough estimate)
| void | None |
Definition at line 232 of file minhash.cpp.
| void LSHDeduplicator::StoreBlacklist | ( | void | ) |
Save Blacklist to file.
| void | None |
Definition at line 344 of file minhash.cpp.