Abstract

We present FLASH (\textbf\{F\}ast \textbf\{L\}SH \textbf\{A\}lgorithm for \textbf\{S\}imilarity search accelerated with \textbf\{H\}PC), a similarity search system for ultra-high dimensional datasets on a single machine, that does not require similarity computations and is tailored for high-performance computing platforms. By leveraging a LSH style randomized indexing procedure and combining it with several principled techniques, such as reservoir sampling, recent advances in one-pass minwise hashing, and count based estimations, we reduce the computational and parallelization costs of similarity search, while retaining sound theoretical guarantees. We evaluate FLASH on several real, high-dimensional datasets from different domains, including text, malicious URL, click-through prediction, social networks, etc. Our experiments shed new light on the difficulties associated with datasets having several million dimensions. Current state-of-the-art implementations either fail on the prese

Authors

(none)

Tags

  • ANN Search

Stats

  • citations16
  • S2 citationsβ€”
  • github stars0
  • HF likes0
  • heat score9.23
  • arxiv keywang2017flash

Related papers