List-Decodable Sparse Mean Estimation via Difference-of-Pairs Filtering
This addresses robust estimation in high-dimensional sparse settings, which is crucial for applications like outlier detection in data analysis, but it is incremental as it builds on prior work in list-decodable mean estimation by extending it to the sparse case.
The paper tackles the problem of list-decodable sparse mean estimation, where a dataset contains a minority of inliers from a distribution with a sparse mean and a majority of arbitrary outliers, and it provides the first sample and computationally efficient algorithm achieving error bounds such as (1/α)^{O(1/t)} with sample complexity (k log(n))^{O(t)}/α, and optimal error Θ(√log(1/α)) for Gaussian inliers.
We study the problem of list-decodable sparse mean estimation. Specifically, for a parameter $α\in (0, 1/2)$, we are given $m$ points in $\mathbb{R}^n$, $\lfloor αm \rfloor$ of which are i.i.d. samples from a distribution $D$ with unknown $k$-sparse mean $μ$. No assumptions are made on the remaining points, which form the majority of the dataset. The goal is to return a small list of candidates containing a vector $\widehat μ$ such that $\| \widehat μ- μ\|_2$ is small. Prior work had studied the problem of list-decodable mean estimation in the dense setting. In this work, we develop a novel, conceptually simpler technique for list-decodable mean estimation. As the main application of our approach, we provide the first sample and computationally efficient algorithm for list-decodable sparse mean estimation. In particular, for distributions with "certifiably bounded" $t$-th moments in $k$-sparse directions and sufficiently light tails, our algorithm achieves error of $(1/α)^{O(1/t)}$ with sample complexity $m = (k\log(n))^{O(t)}/α$ and running time $\mathrm{poly}(mn^t)$. For the special case of Gaussian inliers, our algorithm achieves the optimal error guarantee of $Θ(\sqrt{\log(1/α)})$ with quasi-polynomial sample and computational complexity. We complement our upper bounds with nearly-matching statistical query and low-degree polynomial testing lower bounds.