Number of Clusters in a Dataset: A Regularized K-means Approach
This work addresses a critical hyperparameter selection issue in clustering for data analysis applications, but it is incremental as it builds on existing regularized k-means approaches.
The paper tackles the problem of determining the number of clusters in unlabeled datasets by analyzing regularized k-means algorithms, deriving rigorous bounds for the hyperparameter λ under ideal cluster assumptions and showing that combining additive and multiplicative regularizations reduces solution ambiguity in certain cases.
Finding the number of meaningful clusters in an unlabeled dataset is important in many applications. Regularized k-means algorithm is a possible approach frequently used to find the correct number of distinct clusters in datasets. The most common formulation of the regularization function is the additive linear term $λk$, where $k$ is the number of clusters and $λ$ a positive coefficient. Currently, there are no principled guidelines for setting a value for the critical hyperparameter $λ$. In this paper, we derive rigorous bounds for $λ$ assuming clusters are {\em ideal}. Ideal clusters (defined as $d$-dimensional spheres with identical radii) are close proxies for k-means clusters ($d$-dimensional spherically symmetric distributions with identical standard deviations). Experiments show that the k-means algorithm with additive regularizer often yields multiple solutions. Thus, we also analyze k-means algorithm with multiplicative regularizer. The consensus among k-means solutions with additive and multiplicative regularizations reduces the ambiguity of multiple solutions in certain cases. We also present selected experiments that demonstrate performance of the regularized k-means algorithms as clusters deviate from the ideal assumption.