Fast and Robust Least Squares Estimation in Corrupted Linear Models
This addresses the need for efficient and robust regression methods in large-scale data analysis, particularly in domains prone to outliers, representing an incremental advancement by combining subsampling with influence-based robustness.
The paper tackles the problem of speeding up least squares estimation in large-scale linear models while ensuring robustness to corrupted observations, achieving improvements over state-of-the-art approximation schemes as shown theoretically and empirically on simulated and real datasets.
Subsampling methods have been recently proposed to speed up least squares estimation in large scale settings. However, these algorithms are typically not robust to outliers or corruptions in the observed covariates. The concept of influence that was developed for regression diagnostics can be used to detect such corrupted observations as shown in this paper. This property of influence -- for which we also develop a randomized approximation -- motivates our proposed subsampling algorithm for large scale corrupted linear regression which limits the influence of data points since highly influential points contribute most to the residual error. Under a general model of corrupted observations, we show theoretically and empirically on a variety of simulated and real datasets that our algorithm improves over the current state-of-the-art approximation schemes for ordinary least squares.