Set Similarity Search Beyond Minhash
2016 Β· Tobias Christiani, Rasmus Pagh
Abstract
We consider the problem of approximate set similarity search under Braun-Blanquet similarity \(B(\mathbf\{x\}, \mathbf\{y\}) = |\mathbf\{x\} \cap \mathbf\{y\}| / \max(|\mathbf\{x\}|, |\mathbf\{y\}|)\). The \((b_2, b_2)\)-approximate Braun-Blanquet similarity search problem is to preprocess a collection of sets \(P\) such that, given a query set \(\mathbf\{q\}\), if there exists \(\mathbf\{x\} \in P\) with \(B(\mathbf\{q\}, \mathbf\{x\}) \geq b_1\), then we can efficiently return \(\mathbf\{x\}' \in P\) with \(B(\mathbf\{q\}, \mathbf\{x\}') > b_2\). We present a simple data structure that solves this problem with space usage \(O(n^\{1+\rho\}log n + \sum_\{\mathbf\{x\} \in P\}|\mathbf\{x\}|)\) and query time \(O(|\mathbf\{q\}|n^\{\rho\} log n)\) where \(n = |P|\) and \(\rho = log(1/b_1)/log(1/b_2)\). Making use of existing lower bounds for locality-sensitive hashing by O'Donnell et al. (TOCT 2014) we show that this value of \(\rho\) is tight across the parameter space, i.e., for every
Authors
(none)
Tags
Stats
Related papers
- Subsets And Supermajorities: Optimal Hashing-based Set Similarity Search (2019)5.84
- Optimal Las Vegas Locality Sensitive Data Structures (2017)6.77
- Optimal Hashing-based Time-space Trade-offs For Approximate Near Neighbors (2016)11.29
- Fast Similarity Sketching (2017)9.41
- Efficient Similarity Search In Dynamic Data Streams (2016)0.00
- Locality Sensitive Hashing For Set-queries, Motivated By Group Recommendations (2020)0.00
- Efficient Bitmap-based Indexing And Retrieval Of Similarity Search Image Queries (2019)0.00
- Robust Set Reconciliation Via Locality Sensitive Hashing (2018)2.26