Debiasing Random Oblique Projections for Subsampled OLS and Fast CUR in High Dimensions
For practitioners using random sampling in large-scale least squares and low-rank approximation, this work reveals hidden bias and provides a principled correction, improving accuracy.
The paper identifies and corrects systematic statistical bias in random oblique projections induced by sampling, which is overlooked by standard subspace embedding analyses. For subsampled least squares, it provides sharp bias-variance characterizations and shows debiasing yields provable improvements; for fast CUR decomposition, it develops a debiased approach with improved approximation accuracy.
Random sampling is a fundamental tool in modern machine learning and numerical linear algebra for reducing the computational cost of large-scale matrix problems. Existing analyses, however, rely primarily on subspace embedding guarantees, which do not precisely characterize the statistical bias of nonlinear random oblique projections induced by sampling, which arises ubiquitously in subsampled least squares and fast low-rank approximation methods. Because (pseudo)inversion is nonlinear, these random oblique projections can be systematically biased even when the underlying sketch is unbiased, thereby introducing hidden bias into downstream least squares and low-rank approximation solutions. In this work, we develop a unified non-asymptotic theory for random oblique projections in high dimensions. We show that standard random sampling schemes generally induce a systematic statistical bias overlooked by classical subspace embedding-style analyses, and we propose a principled debiasing framework to correct it. We illustrate the power of the theory through two canonical applications. For subsampled least squares, we obtain sharp bias--variance characterizations, reveal previously unrecognized statistical suboptimality in widely used sampling schemes, and identify when debiasing yields provable improvements. For fast CUR decomposition, we develop a debiased approach with improved approximation accuracy. Numerical experiments further validate our theoretical findings.