LG MLApr 13

Distributionally Robust K-Means Clustering

Vikrant Malik, Taylan Kargin, Babak Hassibi

arXiv:2604.111184.0h-index: 6

Predicted impact top 82% in LG · last 90 daysOriginality Incremental advance

AI Analysis

Provides a principled robustness framework for k-means clustering, addressing outliers and distribution shifts for practitioners in unsupervised learning.

K-means clustering is extended to a distributionally robust variant that minimizes worst-case expected squared distance over a Wasserstein-2 ball around the empirical distribution, yielding a soft-clustering algorithm with provable convergence. Experiments show substantial gains in outlier detection and robustness to noise.

K-means clustering is a workhorse of unsupervised learning, but it is notoriously brittle to outliers, distribution shifts, and limited sample sizes. Viewing k-means as Lloyd--Max quantization of the empirical distribution, we develop a distributionally robust variant that protects against such pathologies. We posit that the unknown population distribution lies within a Wasserstein-2 ball around the empirical distribution. In this setting, one seeks cluster centers that minimize the worst-case expected squared distance over this ambiguity set, leading to a minimax formulation. A tractable dual yields a soft-clustering scheme that replaces hard assignments with smoothly weighted ones. We propose an efficient block coordinate descent algorithm with provable monotonic decrease and local linear convergence. Experiments on standard benchmarks and large-scale synthetic data demonstrate substantial gains in outlier detection and robustness to noise.

View on arXiv PDF

Similar