LGCVOCMLJun 3, 2019

Big-Data Clustering: K-Means or K-Indicators?

arXiv:1906.00938v1
Originality Incremental advance
AI Analysis

This work addresses a scalability problem for practitioners in big data clustering, offering an incremental improvement over existing methods.

The paper tackles the scalability bottleneck of K-means clustering when the number of clusters K grows in big data applications by promoting the K-indicators model and developing an efficient algorithm that avoids randomized initializations. The result shows that using this new algorithm to initialize K-means significantly outperforms standard K-means with state-of-the-art random replications, particularly for large K.

The K-means algorithm is arguably the most popular data clustering method, commonly applied to processed datasets in some "feature spaces", as is in spectral clustering. Highly sensitive to initializations, however, K-means encounters a scalability bottleneck with respect to the number of clusters K as this number grows in big data applications. In this work, we promote a closely related model called K-indicators model and construct an efficient, semi-convex-relaxation algorithm that requires no randomized initializations. We present extensive empirical results to show advantages of the new algorithm when K is large. In particular, using the new algorithm to start the K-means algorithm, without any replication, can significantly outperform the standard K-means with a large number of currently state-of-the-art random replications.

Code Implementations1 repo
Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes