Subsets And Supermajorities: Optimal Hashing-based Set Similarity Search
2019 · Thomas Dybdahl Ahle, Jakob Bæk Tejs Knudsen
Abstract
We formulate and optimally solve a new generalized Set Similarity Search problem, which assumes the size of the database and query sets are known in advance. By creating polylog copies of our data-structure, we optimally solve any symmetric Approximate Set Similarity Search problem, including approximate versions of Subset Search, Maximum Inner Product Search (MIPS), Jaccard Similarity Search and Partial Match. Our algorithm can be seen as a natural generalization of previous work on Set as well as Euclidean Similarity Search, but conceptually it differs by optimally exploiting the information present in the sets as well as their complements, and doing so asymmetrically between queries and stored sets. Doing so we improve upon the best previous work: MinHash [J. Discrete Algorithms 1998], SimHash [STOC 2002], Spherical LSF [SODA 2016, 2017] and Chosen Path [STOC 2017] by as much as a factor \(n^\{0.14\}\) in both time and space; or in the near-constant time regime, in space, by an ar
Authors
(none)
Tags
Stats
Related papers
- Set Similarity Search Beyond Minhash (2016)10.74
- Locality Sensitive Hashing For Set-queries, Motivated By Group Recommendations (2020)0.00
- Optimal Las Vegas Locality Sensitive Data Structures (2017)6.77
- Efficient Similarity Search In Dynamic Data Streams (2016)0.00
- A Memory-efficient Sketch Method For Estimating High Similarities In Streaming Sets (2019)12.02
- Superminhash - A New Minwise Hashing Algorithm For Jaccard Similarity Estimation (2017)0.00
- Fast Similarity Sketching (2017)9.41
- Improving Similarity Search With High-dimensional Locality-sensitive Hashing (2018)0.00