|
Corpus Cleaner
|
#include <minhash.hpp>
Public Member Functions | |
| GenerateDedupLSH (uint32_t n_gram, uint32_t n_minhash, uint32_t n_buckets, uint32_t bucket_size) | |
| ~GenerateDedupLSH () | |
| vector< wstring > | NGramTokenize (wstring text, int32_t n) |
| Tokenize a string into n-gram tokens. | |
| uint64_t | GetMinhash (vector< wstring > *tokens, uint32_t seed) |
| Calculate minhash of tokens list. | |
| vector< string > | CalculateLSH (wstring text) |
| Calculate minhash list of text. | |
Definition at line 6 of file minhash.hpp.
| GenerateDedupLSH::GenerateDedupLSH | ( | uint32_t | n_gram = 5, |
| uint32_t | n_minhash = 200, | ||
| uint32_t | n_buckets = 20, | ||
| uint32_t | bucket_size = 10 ) |
Definition at line 6 of file minhash.cpp.
| GenerateDedupLSH::~GenerateDedupLSH | ( | ) |
Definition at line 19 of file minhash.cpp.
| vector< string > GenerateDedupLSH::CalculateLSH | ( | wstring | text | ) |
Calculate minhash list of text.
Generates a sequence of hash values for duplicate handling from text.
If two documents have the same hash value at most, the documents are considered duplicates.
A list of hash values, each hash value is in the format '0+07ad0b7b163f434643387f3f4799a2d466bccd0c',
The first two characters represent the hash value.
This allows duplicate processing by pooling duplicate processing hashes into a single hash table.
Example:
| wstring | text: input sentence |
Definition at line 107 of file minhash.cpp.
| uint64_t GenerateDedupLSH::GetMinhash | ( | vector< wstring > * | tokens, |
| uint32_t | seed ) |
Calculate minhash of tokens list.
Example:
| vector<wstring> | *tokens: tokens list |
| uint32_t | seed: the seed for murmurminhash3's calculation |
Definition at line 75 of file minhash.cpp.
| vector< wstring > GenerateDedupLSH::NGramTokenize | ( | wstring | text, |
| int32_t | n ) |
Tokenize a string into n-gram tokens.
Example:
| wstring | text: input text |
| string | n: the n number of n_gram |
Definition at line 36 of file minhash.cpp.