Scalable K-means Clustering For Large K Via Seeded Approximate Nearest-neighbor Search
2025 Β· Jack Spalding-Jamieson, Eliot Wong Robson, da Wei Zheng
Abstract
For very large values of \(k\), we consider methods for fast \(k\)-means clustering of massive datasets with \(10^7\sim10^9\) points in high-dimensions (\(d\geq100\)). All current practical methods for this problem have runtimes at least \(Ξ©(k^2)\). We find that initialization routines are not a bottleneck for this case. Instead, it is critical to improve the speed of Lloyd's local-search algorithm, particularly the step that reassigns points to their closest center. Attempting to improve this step naturally leads us to leverage approximate nearest-neighbor search methods, although this alone is not enough to be practical. Instead, we propose a family of problems we call "Seeded Approximate Nearest-Neighbor Search", for which we propose "Seeded Search-Graph" methods as a solution.
Authors
(none)
Tags
Stats
Related papers
- Unconventional Application Of K-means For Distributed Approximate Similarity Search (2022)5.84
- Hd-index: Pushing The Scalability-accuracy Boundary For Approximate Knn Search In High-dimensional Spaces (2018)14.02
- Let Them Have CAKES: A Cutting-edge Algorithm For Scalable, Efficient, And Exact Search On Big Data (2023)2.68
- A Learning-to-rank Formulation Of Clustering-based Approximate Nearest Neighbor Search (2024)4.52
- PECANN: Parallel Efficient Clustering With Graph-based Approximate Nearest Neighbor Search (2023)0.00
- Adaptive Estimation For Approximate K-nearest-neighbor Computations (2019)0.00
- A Scalable Solution To The Nearest Neighbor Search Problem Through Local-search Methods On Neighbor Graphs (2017)3.58
- Refining A \(k\)-nearest Neighbor Graph For A Computationally Efficient Spectral Clustering (2023)12.25