LGCGMLFeb 10, 2025

Scalable k-Means Clustering for Large k via Seeded Approximate Nearest-Neighbor Search

arXiv:2502.06163v12 citationsh-index: 4
Originality Incremental advance
AI Analysis

This addresses scalability issues in clustering for big data applications, but it is incremental as it builds on existing approximate nearest-neighbor methods.

The paper tackles the problem of fast k-means clustering for massive datasets with very large k, where existing methods have runtimes at least Ω(k^2), and proposes a solution using Seeded Approximate Nearest-Neighbor Search, achieving practical improvements in speed.

For very large values of $k$, we consider methods for fast $k$-means clustering of massive datasets with $10^7\sim10^9$ points in high-dimensions ($d\geq100$). All current practical methods for this problem have runtimes at least $Ω(k^2)$. We find that initialization routines are not a bottleneck for this case. Instead, it is critical to improve the speed of Lloyd's local-search algorithm, particularly the step that reassigns points to their closest center. Attempting to improve this step naturally leads us to leverage approximate nearest-neighbor search methods, although this alone is not enough to be practical. Instead, we propose a family of problems we call "Seeded Approximate Nearest-Neighbor Search", for which we propose "Seeded Search-Graph" methods as a solution.

Code Implementations1 repo
Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes