STCCLGMLMay 21, 2020

Computationally efficient sparse clustering

arXiv:2005.10817v318 citations
Originality Incremental advance
AI Analysis

This addresses computational efficiency in clustering for high-dimensional statistics, providing theoretical insights into necessary conditions for polynomial-time algorithms, though it is incremental in extending existing lower bounds.

The paper tackles the problem of clustering high-dimensional data with sparse cluster centers, showing that a new algorithm based on sparse PCA achieves the minimax optimal misclustering rate as the signal strength increases, with sparsity growing slower than the square root of the sample size.

We study statistical and computational limits of clustering when the means of the centres are sparse and their dimension is possibly much larger than the sample size. Our theoretical analysis focuses on the model $X_i = z_i θ+ \varepsilon_i, ~z_i \in \{-1,1\}, ~\varepsilon_i \thicksim \mathcal{N}(0,I)$, which has two clusters with centres $θ$ and $-θ$. We provide a finite sample analysis of a new sparse clustering algorithm based on sparse PCA and show that it achieves the minimax optimal misclustering rate in the regime $\|θ\| \rightarrow \infty$. Our results require the sparsity to grow slower than the square root of the sample size. Using a recent framework for computational lower bounds -- the low-degree likelihood ratio -- we give evidence that this condition is necessary for any polynomial-time clustering algorithm to succeed below the BBP threshold. This complements existing evidence based on reductions and statistical query lower bounds. Compared to these existing results, we cover a wider set of parameter regimes and give a more precise understanding of the runtime required and the misclustering error achievable. Our results imply that a large class of tests based on low-degree polynomials fail to solve even the weak testing task.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes