Fast Similarity Sketching
2017 · Søren Dahlgaard, Mathias Bæk Tejs Langhede, Jakob Bæk Tejs Houen, et al.
Abstract
We consider the \(\textit\{Similarity Sketching\}\) problem: Given a universe \([u] = \\{0,\ldots, u-1\\}\) we want a random function \(S\) mapping subsets \(A\subseteq [u]\) into vectors \(S(A)\) of size \(t\), such that the Jaccard similarity \(J(A,B) = |A\cap B|/|A\cup B|\) between sets \(A\) and \(B\) is preserved. More precisely, define \(X_i = [S(A)[i] = S(B)[i]]\) and \(X = \sum_\{i\in [t]\} X_i\). We want \(E[X_i]=J(A,B)\), and we want \(X\) to be strongly concentrated around \(E[X] = t \cdot J(A,B)\) (i.e. Chernoff-style bounds). This is a fundamental problem which has found numerous applications in data mining, large-scale classification, computer vision, similarity search, etc. via the classic MinHash algorithm. The vectors \(S(A)\) are also called \(\textit\{sketches\}\). Strong concentration is critical, for often we want to sketch many sets \(B_1,\ldots,B_n\) so that we later, for a query set \(A\), can find (one of) the most similar \(B_i\). It is then critical that no
Authors
(none)
Tags
Stats
Related papers
- Simisketch: Efficiently Estimating Similarity Of Streaming Multisets (2024)0.00
- A Memory-efficient Sketch Method For Estimating High Similarities In Streaming Sets (2019)12.02
- Subsets And Supermajorities: Optimal Hashing-based Set Similarity Search (2019)5.84
- Set Similarity Search Beyond Minhash (2016)10.74
- Efficient Similarity Search In Dynamic Data Streams (2016)0.00
- Streaming Binary Sketching Based On Subspace Tracking And Diagonal Uniformization (2017)0.00
- Weighted Minwise Hashing Beats Linear Sketching For Inner Product Estimation (2023)3.58
- Higher-order Count Sketch: Dimensionality Reduction That Retains Efficient Tensor Operations (2019)4.52