LGFeb 24

High-Dimensional Robust Mean Estimation with Untrusted Batches

Maryam Aliakbarpour, Vladimir Braverman, Yuhan Liu, Junze Yin

arXiv:2602.20698v1h-index: 8

Originality Highly original

AI Analysis

This addresses robust statistical estimation for collaborative learning systems where data sources may be untrusted or heterogeneous, with incremental improvements over prior work on untrusted batch models.

The paper tackles high-dimensional mean estimation in a collaborative setting where data comes from multiple users in batches, with some users being adversarial and others providing statistically heterogeneous data. They develop Sum-of-Squares algorithms that achieve minimax-optimal error rates of O(√(ε/n) + √(d/nN) + √α), showing that batch structure reduces adversarial influence by a factor of 1/√n.

We study high-dimensional mean estimation in a collaborative setting where data is contributed by $N$ users in batches of size $n$. In this environment, a learner seeks to recover the mean $μ$ of a true distribution $P$ from a collection of sources that are both statistically heterogeneous and potentially malicious. We formalize this challenge through a double corruption landscape: an $\varepsilon$-fraction of users are entirely adversarial, while the remaining ``good'' users provide data from distributions that are related to $P$, but deviate by a proximity parameter $α$. Unlike existing work on the untrusted batch model, which typically measures this deviation via total variation distance in discrete settings, we address the continuous, high-dimensional regime under two natural variants for deviation: (1) good batches are drawn from distributions with a mean-shift of $\sqrtα$, or (2) an $α$-fraction of samples within each good batch are adversarially corrupted. In particular, the second model presents significant new challenges: in high dimensions, unlike discrete settings, even a small fraction of sample-level corruption can shift empirical means and covariances arbitrarily. We provide two Sum-of-Squares (SoS) based algorithms to navigate this tiered corruption. Our algorithms achieve the minimax-optimal error rate $O(\sqrt{\varepsilon/n} + \sqrt{d/nN} + \sqrtα)$, demonstrating that while heterogeneity $α$ represents an inherent statistical difficulty, the influence of adversarial users is suppressed by a factor of $1/\sqrt{n}$ due to the internal averaging afforded by the batch structure.

View on arXiv PDF

Similar