LGNov 18, 2015

Seeding K-Means using Method of Moments

arXiv:1511.05933v81.1

Originality Highly original

AI Analysis

This addresses the computational bottleneck in K-means++ for data mining applications, offering a more efficient seeding approach.

The paper tackles the problem of costly seeding in K-means clustering for large datasets by proposing a method based on factorizations of higher order moments, achieving a final cost within O(√K) of optimal and requiring only O(1) passes through the data.

K-means is one of the most widely used algorithms for clustering in Data Mining applications, which attempts to minimize the sum of the square of the Euclidean distance of the points in the clusters from the respective means of the clusters. However, K-means suffers from local minima problem and is not guaranteed to converge to the optimal cost. K-means++ tries to address the problem by seeding the means using a distance-based sampling scheme. However, seeding the means in K-means++ needs $O\left(K\right)$ sequential passes through the entire dataset, and this can be very costly for large datasets. Here we propose a method of seeding the initial means based on factorizations of higher order moments for bounded data. Our method takes $O\left(1\right)$ passes through the entire dataset to extract the initial set of means, and its final cost can be proven to be within $O(\sqrt{K})$ of the optimal cost. We demonstrate the performance of our algorithm in comparison with the existing algorithms on various benchmark datasets.

View on arXiv PDF

Similar