Massively-parallel Similarity Join, Edge-isoperimetry, And Distance Correlations On The Hypercube
2016 Β· Paul Beame, Cyrus Rashtchian
Abstract
We study distributed protocols for finding all pairs of similar vectors in a large dataset. Our results pertain to a variety of discrete metrics, and we give concrete instantiations for Hamming distance. In particular, we give improved upper bounds on the overhead required for similarity defined by Hamming distance \(r>1\) and prove a lower bound showing qualitative optimality of the overhead required for similarity over any Hamming distance \(r\). Our main conceptual contribution is a connection between similarity search algorithms and certain graph-theoretic quantities. For our upper bounds, we exhibit a general method for designing one-round protocols using edge-isoperimetric shapes in similarity graphs. For our lower bounds, we define a new combinatorial optimization problem, which can be stated in purely graph-theoretic terms yet also captures the core of the analysis in previous theoretical work on distributed similarity joins. As one of our main technical results, we prove new b
Authors
(none)
Tags
Stats
Related papers
- Dynamic Enumeration Of Similarity Joins (2021)0.00
- Improving Distributed Similarity Join In Metric Space With Error-bounded Sampling (2019)0.00
- Embassi: Embedding Assignment Costs For Similarity Search In Large Graph Databases (2021)2.26
- Subsets And Supermajorities: Optimal Hashing-based Set Similarity Search (2019)5.84
- When Hashes Met Wedges: A Distributed Algorithm For Finding High Similarity Vectors (2017)5.84
- Lsf-join: Locality Sensitive Filtering For Distributed All-pairs Set Similarity Under Skew (2020)6.34
- Unconventional Application Of K-means For Distributed Approximate Similarity Search (2022)5.84
- Hilbert Exclusion: Improved Metric Search Through Finite Isometric Embeddings (2016)10.07