LG DSDec 22, 2020

Fast and Accurate $k$-means++ via Rejection Sampling

Vincent Cohen-Addad, Silvio Lattanzi, Ashkan Norouzi-Fard, Christian Sohler, Ola Svensson

arXiv:2012.11891v17.925 citations

Originality Incremental advance

AI Analysis

This work provides a more efficient $k$-means++ seeding algorithm for practitioners working with large datasets, offering a practical improvement for a widely used clustering method.

The paper introduces a near-linear time algorithm for $k$-means++ seeding, addressing its slowness on large datasets. This new algorithm achieves the same theoretical guarantees as $k$-means++ while being significantly faster empirically and maintaining equivalent solution quality.

$k$-means++ \cite{arthur2007k} is a widely used clustering algorithm that is easy to implement, has nice theoretical guarantees and strong empirical performance. Despite its wide adoption, $k$-means++ sometimes suffers from being slow on large data-sets so a natural question has been to obtain more efficient algorithms with similar guarantees. In this paper, we present a near linear time algorithm for $k$-means++ seeding. Interestingly our algorithm obtains the same theoretical guarantees as $k$-means++ and significantly improves earlier results on fast $k$-means++ seeding. Moreover, we show empirically that our algorithm is significantly faster than $k$-means++ and obtains solutions of equivalent quality.

View on arXiv PDF

Similar