DS LGNov 8, 2018

Performance of Johnson-Lindenstrauss Transform for k-Means and k-Medians Clustering

Konstantin Makarychev, Yury Makarychev, Ilya Razenshteyn

arXiv:1811.03195v222.1149 citations

Originality Highly original

AI Analysis

This provides a theoretical guarantee for using dimensionality reduction to speed up clustering algorithms while maintaining accuracy, addressing a core problem in machine learning and data analysis.

The paper shows that projecting Euclidean k-means and k-medians clustering instances onto a random O(log(k/ε)/ε²)-dimensional subspace preserves optimal and all clustering costs within a (1+ε) factor, resolving open problems in the field.

Consider an instance of Euclidean $k$-means or $k$-medians clustering. We show that the cost of the optimal solution is preserved up to a factor of $(1+\varepsilon)$ under a projection onto a random $O(\log(k / \varepsilon) / \varepsilon^2)$-dimensional subspace. Further, the cost of every clustering is preserved within $(1+\varepsilon)$. More generally, our result applies to any dimension reduction map satisfying a mild sub-Gaussian-tail condition. Our bound on the dimension is nearly optimal. Additionally, our result applies to Euclidean $k$-clustering with the distances raised to the $p$-th power for any constant $p$. For $k$-means, our result resolves an open problem posed by Cohen, Elder, Musco, Musco, and Persu (STOC 2015); for $k$-medians, it answers a question raised by Kannan.

View on arXiv PDF

Similar