Probminhash -- A Class Of Locality-sensitive Hash Algorithms For The (probability) Jaccard Similarity
2019 Β· Otmar Ertl
Abstract
The probability Jaccard similarity was recently proposed as a natural generalization of the Jaccard similarity to measure the proximity of sets whose elements are associated with relative frequencies or probabilities. In combination with a hash algorithm that maps those weighted sets to compact signatures which allow fast estimation of pairwise similarities, it constitutes a valuable method for big data applications such as near-duplicate detection, nearest neighbor search, or clustering. This paper introduces a class of one-pass locality-sensitive hash algorithms that are orders of magnitude faster than the original approach. The performance gain is achieved by calculating signature components not independently, but collectively. Four different algorithms are proposed based on this idea. Two of them are statistically equivalent to the original approach and can be used as drop-in replacements. The other two may even improve the estimation error by introducing statistical dependence bet
Authors
(none)
Tags
Stats
Related papers
- Superminhash - A New Minwise Hashing Algorithm For Jaccard Similarity Estimation (2017)0.00
- Maximally Consistent Sampling And The Jaccard Index Of Probability Distributions (2018)0.00
- Practical Hash Functions For Similarity Estimation And Dimensionality Reduction (2017)0.00
- Differentially Private One Permutation Hashing And Bin-wise Consistent Weighted Sampling (2023)0.00
- Subsets And Supermajorities: Optimal Hashing-based Set Similarity Search (2019)5.84
- Analysis Of Sparsehash: An Efficient Embedding Of Set-similarity Via Sparse Projections (2019)4.52
- Optimal Las Vegas Locality Sensitive Data Structures (2017)6.77
- A Memory-efficient Sketch Method For Estimating High Similarities In Streaming Sets (2019)12.02