Maximally Consistent Sampling And The Jaccard Index Of Probability Distributions
2018 Β· Ryan Moulton, Yunjiang Jiang
Abstract
We introduce simple, efficient algorithms for computing a MinHash of a probability distribution, suitable for both sparse and dense data, with equivalent running times to the state of the art for both cases. The collision probability of these algorithms is a new measure of the similarity of positive vectors which we investigate in detail. We describe the sense in which this collision probability is optimal for any Locality Sensitive Hash based on sampling. We argue that this similarity measure is more useful for probability distributions than the similarity pursued by other algorithms for weighted MinHash, and is the natural generalization of the Jaccard index.
Authors
(none)
Tags
Stats
Related papers
- Probminhash -- A Class Of Locality-sensitive Hash Algorithms For The (probability) Jaccard Similarity (2019)9.92
- Superminhash - A New Minwise Hashing Algorithm For Jaccard Similarity Estimation (2017)0.00
- Differentially Private One Permutation Hashing And Bin-wise Consistent Weighted Sampling (2023)0.00
- Analysis Of Sparsehash: An Efficient Embedding Of Set-similarity Via Sparse Projections (2019)4.52
- Efficient Similarity Search In Dynamic Data Streams (2016)0.00
- A Memory-efficient Sketch Method For Estimating High Similarities In Streaming Sets (2019)12.02
- Optimal Las Vegas Locality Sensitive Data Structures (2017)6.77
- Forestdsh: A Universal Hash Design For Discrete Probability Distributions (2019)0.00