Moment-based Uniform Deviation Bounds for $k$-means and Friends
This work addresses theoretical guarantees for clustering algorithms, offering uniform deviation bounds that are incremental but provide new insights for distributions with moment constraints.
The paper tackles the problem of bounding the difference between sample and distribution costs for k-means clustering when fitting k centers to m points from distributions with bounded moments, showing the deviation decays as m^{min{-1/4, -1/2+2/p}}. It also extends this analysis to a soft clustering variant and provides refined rates for structured instances.
Suppose $k$ centers are fit to $m$ points by heuristically minimizing the $k$-means cost; what is the corresponding fit over the source distribution? This question is resolved here for distributions with $p\geq 4$ bounded moments; in particular, the difference between the sample cost and distribution cost decays with $m$ and $p$ as $m^{\min\{-1/4, -1/2+2/p\}}$. The essential technical contribution is a mechanism to uniformly control deviations in the face of unbounded parameter sets, cost functions, and source distributions. To further demonstrate this mechanism, a soft clustering variant of $k$-means cost is also considered, namely the log likelihood of a Gaussian mixture, subject to the constraint that all covariance matrices have bounded spectrum. Lastly, a rate with refined constants is provided for $k$-means instances possessing some cluster structure.