MLMay 22, 2017

Improved Clustering with Augmented k-means

arXiv:1705.07592v14 citations
Originality Incremental advance
AI Analysis

This is an incremental improvement for unsupervised clustering tasks, particularly useful for real datasets with characteristics like heterogeneity and overlap.

The paper tackles the problem of clustering heterogeneous datasets by developing Augmented k-means, a hybrid of k-means and logistic regression that excludes uncertain observations during re-estimation, resulting in more accurate classification and faster convergence compared to standard k-means.

Identifying a set of homogeneous clusters in a heterogeneous dataset is one of the most important classes of problems in statistical modeling. In the realm of unsupervised partitional clustering, k-means is a very important algorithm for this. In this technical report, we develop a new k-means variant called Augmented k-means, which is a hybrid of k-means and logistic regression. During each iteration, logistic regression is used to predict the current cluster labels, and the cluster belonging probabilities are used to control the subsequent re-estimation of cluster means. Observations which can't be firmly identified into clusters are excluded from the re-estimation step. This can be valuable when the data exhibit many characteristics of real datasets such as heterogeneity, non-sphericity, substantial overlap, and high scatter. Augmented k-means frequently outperforms k-means by more accurately classifying observations into known clusters and / or converging in fewer iterations. We demonstrate this on both simulated and real datasets. Our algorithm is implemented in Python and will be available with this report.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes