MLAug 26, 2016

Estimating the Number of Clusters via Normalized Cluster Instability

arXiv:1608.07494v42 citations
Originality Incremental advance
AI Analysis

This work addresses a specific issue in cluster analysis for data scientists, offering an incremental improvement over existing instability-based methods.

The paper tackles the problem of selecting the number of clusters in cluster analysis by developing a normalized cluster instability measure that corrects for cluster size distribution, showing it outperforms current methods across all k values, especially for large k, with performance comparable between model-based and model-free approaches.

We improve current instability-based methods for the selection of the number of clusters $k$ in cluster analysis by developing a normalized cluster instability measure that corrects for the distribution of cluster sizes, a previously unaccounted driver of cluster instability. We show that our normalized instability measure outperforms current instability-based measures across the whole sequence of possible $k$ and especially overcomes limitations in the context of large $k$. We also compare, for the first time, model-based and model-free approaches to determine cluster-instability and find their performance to be comparable. We make our method available in the R-package \verb+cstab+.

Code Implementations2 repos
Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes