Abstract

We consider the \(\textit\{Similarity Sketching\}\) problem: Given a universe \([u] = \\{0,\ldots, u-1\\}\) we want a random function \(S\) mapping subsets \(A\subseteq [u]\) into vectors \(S(A)\) of size \(t\), such that the Jaccard similarity \(J(A,B) = |A\cap B|/|A\cup B|\) between sets \(A\) and \(B\) is preserved. More precisely, define \(X_i = [S(A)[i] = S(B)[i]]\) and \(X = \sum_\{i\in [t]\} X_i\). We want \(E[X_i]=J(A,B)\), and we want \(X\) to be strongly concentrated around \(E[X] = t \cdot J(A,B)\) (i.e. Chernoff-style bounds). This is a fundamental problem which has found numerous applications in data mining, large-scale classification, computer vision, similarity search, etc. via the classic MinHash algorithm. The vectors \(S(A)\) are also called \(\textit\{sketches\}\). Strong concentration is critical, for often we want to sketch many sets \(B_1,\ldots,B_n\) so that we later, for a query set \(A\), can find (one of) the most similar \(B_i\). It is then critical that no

Authors

(none)

Tags

  • Uncategorized

Stats

  • citations17
  • S2 citations
  • github stars0
  • HF likes0
  • heat score9.41
  • arxiv keydahlgaard2017fast

Related papers