DSLGFeb 4, 2025

A New Rejection Sampling Approach to $k$-$\mathtt{means}$++ With Improved Trade-Offs

arXiv:2502.02085v1h-index: 13
Originality Incremental advance
AI Analysis

This work addresses the scalability issue of k-means clustering for large datasets, offering a faster method with provable guarantees, though it is incremental as it builds on existing seeding techniques.

The authors tackled the high computational cost of the k-means++ seeding algorithm by proposing a rejection sampling approach that speeds it up, achieving a runtime of O~(nnz(X) + βk^2d) while maintaining O(log k) competitiveness in expectation, with empirical validation on real datasets.

The $k$-$\mathtt{means}$++ seeding algorithm (Arthur & Vassilvitskii, 2007) is widely used in practice for the $k$-means clustering problem where the goal is to cluster a dataset $\mathcal{X} \subset \mathbb{R} ^d$ into $k$ clusters. The popularity of this algorithm is due to its simplicity and provable guarantee of being $O(\log k)$ competitive with the optimal solution in expectation. However, its running time is $O(|\mathcal{X}|kd)$, making it expensive for large datasets. In this work, we present a simple and effective rejection sampling based approach for speeding up $k$-$\mathtt{means}$++. Our first method runs in time $\tilde{O}(\mathtt{nnz} (\mathcal{X}) + βk^2d)$ while still being $O(\log k )$ competitive in expectation. Here, $β$ is a parameter which is the ratio of the variance of the dataset to the optimal $k$-$\mathtt{means}$ cost in expectation and $\tilde{O}$ hides logarithmic factors in $k$ and $|\mathcal{X}|$. Our second method presents a new trade-off between computational cost and solution quality. It incurs an additional scale-invariant factor of $ k^{-Ω( m/β)} \operatorname{Var} (\mathcal{X})$ in addition to the $O(\log k)$ guarantee of $k$-$\mathtt{means}$++ improving upon a result of (Bachem et al, 2016a) who get an additional factor of $m^{-1}\operatorname{Var}(\mathcal{X})$ while still running in time $\tilde{O}(\mathtt{nnz}(\mathcal{X}) + mk^2d)$. We perform extensive empirical evaluations to validate our theoretical results and to show the effectiveness of our approach on real datasets.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes