On the optimality of kernels for high-dimensional clustering
This work provides theoretical guarantees for kernel clustering in high-dimensional data, addressing a gap in understanding optimality for practitioners in machine learning and statistics, though it is incremental as it builds on existing limits and bounds.
This paper tackles the problem of high-dimensional Gaussian clustering by analyzing the optimality of kernel methods, specifically kernel k-means, and shows that with an exponential kernel, the sufficient conditions for partial recovery match the information-theoretic limit up to a factor of √2 for large k and align with known upper bounds for non-kernel settings.
This paper studies the optimality of kernel methods in high-dimensional data clustering. Recent works have studied the large sample performance of kernel clustering in the high-dimensional regime, where Euclidean distance becomes less informative. However, it is unknown whether popular methods, such as kernel k-means, are optimal in this regime. We consider the problem of high-dimensional Gaussian clustering and show that, with the exponential kernel function, the sufficient conditions for partial recovery of clusters using the NP-hard kernel k-means objective matches the known information-theoretic limit up to a factor of $\sqrt{2}$ for large $k$. It also exactly matches the known upper bounds for the non-kernel setting. We also show that a semi-definite relaxation of the kernel k-means procedure matches up to constant factors, the spectral threshold, below which no polynomial-time algorithm is known to succeed. This is the first work that provides such optimality guarantees for the kernel k-means as well as its convex relaxation. Our proofs demonstrate the utility of the less known polynomial concentration results for random variables with exponentially decaying tails in a higher-order analysis of kernel methods.