LGIRJan 17, 2017

Faster K-Means Cluster Estimation

arXiv:1701.04600v17 citations
Originality Incremental advance
AI Analysis

This work addresses a specific computational inefficiency in clustering algorithms for data scientists and practitioners, offering an incremental improvement to existing K-means variants.

The paper tackles the bottleneck in K-means clustering where each data point's distance to all centroids is computed every iteration, proposing a fast heuristic that predicts cluster membership changes to reduce computations, achieving up to 3 times speed-up with only marginal increase in mean squared error.

There has been considerable work on improving popular clustering algorithm `K-means' in terms of mean squared error (MSE) and speed, both. However, most of the k-means variants tend to compute distance of each data point to each cluster centroid for every iteration. We propose a fast heuristic to overcome this bottleneck with only marginal increase in MSE. We observe that across all iterations of K-means, a data point changes its membership only among a small subset of clusters. Our heuristic predicts such clusters for each data point by looking at nearby clusters after the first iteration of k-means. We augment well known variants of k-means with our heuristic to demonstrate effectiveness of our heuristic. For various synthetic and real-world datasets, our heuristic achieves speed-up of up-to 3 times when compared to efficient variants of k-means.

Code Implementations1 repo
Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes