Streaming Algorithms for High-Dimensional Robust Statistics
This work addresses the memory bottleneck for robust statistics in streaming settings, which is incremental but important for applications requiring real-time data processing.
The paper tackles the problem of high-dimensional robust statistics in the streaming model, where previous algorithms required storing entire datasets with quadratic memory. It presents the first efficient streaming algorithms with near-optimal memory, achieving nearly-linear space complexity for robust mean estimation and extending to tasks like robust covariance estimation and regression.
We study high-dimensional robust statistics tasks in the streaming model. A recent line of work obtained computationally efficient algorithms for a range of high-dimensional robust estimation tasks. Unfortunately, all previous algorithms require storing the entire dataset, incurring memory at least quadratic in the dimension. In this work, we develop the first efficient streaming algorithms for high-dimensional robust statistics with near-optimal memory requirements (up to logarithmic factors). Our main result is for the task of high-dimensional robust mean estimation in (a strengthening of) Huber's contamination model. We give an efficient single-pass streaming algorithm for this task with near-optimal error guarantees and space complexity nearly-linear in the dimension. As a corollary, we obtain streaming algorithms with near-optimal space complexity for several more complex tasks, including robust covariance estimation, robust regression, and more generally robust stochastic optimization.