Algorithms for finding $k$ in $k$-means
This work provides a foundational solution for researchers and practitioners in clustering and mixture modeling by automating the selection of $k$, which was previously a manual or heuristic process.
This paper addresses the problem of finding the optimal number of clusters, $k$, in $k$-means clustering, a long-standing open problem. The authors propose a data-determined definition of $k$ and a polynomial-time algorithm to identify it, under the assumptions of a Ground Truth clustering with separated centers and a novel "no tight sub-cluster" (NTSC) constraint. Their algorithm can also find the number of components in mixtures of Gaussians and sub-Gaussian probability density functions.
$k-$means Clustering requires as input the exact value of $k$, the number of clusters. Two challenges are open: (i) Is there a data-determined definition of $k$ which is provably correct and (ii) Is there a polynomial time algorithm to find $k$ from data ? This paper provides the first affirmative answers to both these questions. As common in the literature, we assume that the data admits an unknown Ground Truth (GT) clustering with cluster centers separated. This assumption alone is not sufficient to answer Yes to (i). We assume a novel, but natural second constraint called no tight sub-cluster (NTSC) which stipulates that no substantially large subset of a GT cluster can be "tighter" (in a sense we define) than the cluster. Our yes answer to (i) and (ii) are under these two deterministic assumptions. We also give polynomial time algorithm to identify $k$. Our algorithm relies on NTSC to peel off one cluster at a time by identifying points which are tightly packed. We are also able to show that our algorithm(s) apply to data generated by mixtures of Gaussians and more generally to mixtures of sub-Gaussian pdf's and hence are able to find the number of components of the mixture from data. To our knowledge, previous results for these specialized settings as well, assume generally that $k$ is given besides the data.