A Memory-efficient Sketch Method For Estimating High Similarities In Streaming Sets
2019 Β· Pinghui Wang, Yiyan Qi, Yuanming Zhang, et al.
Abstract
Estimating set similarity and detecting highly similar sets are fundamental problems in areas such as databases, machine learning, and information retrieval. MinHash is a well-known technique for approximating Jaccard similarity of sets and has been successfully used for many applications such as similarity search and large scale learning. Its two compressed versions, b-bit MinHash and Odd Sketch, can significantly reduce the memory usage of the original MinHash method, especially for estimating high similarities (i.e., similarities around 1). Although MinHash can be applied to static sets as well as streaming sets, of which elements are given in a streaming fashion and cardinality is unknown or even infinite, unfortunately, b-bit MinHash and Odd Sketch fail to deal with streaming data. To solve this problem, we design a memory efficient sketch method, MaxLogHash, to accurately estimate Jaccard similarities in streaming sets. Compared to MinHash, our method uses smaller sized registers
Authors
(none)
Tags
Stats
Related papers
- Simisketch: Efficiently Estimating Similarity Of Streaming Multisets (2024)0.00
- Efficient Similarity Search In Dynamic Data Streams (2016)0.00
- Fast Similarity Sketching (2017)9.41
- Superminhash - A New Minwise Hashing Algorithm For Jaccard Similarity Estimation (2017)0.00
- Subsets And Supermajorities: Optimal Hashing-based Set Similarity Search (2019)5.84
- Analysis Of Sparsehash: An Efficient Embedding Of Set-similarity Via Sparse Projections (2019)4.52
- Weighted Minwise Hashing Beats Linear Sketching For Inner Product Estimation (2023)3.58
- Sub-linear Memory Sketches For Near Neighbor Search On Streaming Data (2019)0.00