Analysis of Bootstrap and Subsampling in High-dimensional Regularized Regression
This work addresses the problem of uncertainty estimation for statisticians and machine learning practitioners, highlighting limitations in high-dimensional settings, but it is incremental as it builds on existing asymptotic theory.
The paper analyzes the performance of resampling methods like bootstrap and subsampling for uncertainty estimation in high-dimensional regularized regression, finding that they are unreliable in over-parametrized regimes and only become consistent when the sample-to-dimension ratio is sufficiently large, with provided convergence rates.
We investigate popular resampling methods for estimating the uncertainty of statistical models, such as subsampling, bootstrap and the jackknife, and their performance in high-dimensional supervised regression tasks. We provide a tight asymptotic description of the biases and variances estimated by these methods in the context of generalized linear models, such as ridge and logistic regression, taking the limit where the number of samples $n$ and dimension $d$ of the covariates grow at a comparable fixed rate $α\!=\! n/d$. Our findings are three-fold: i) resampling methods are fraught with problems in high dimensions and exhibit the double-descent-like behavior typical of these situations; ii) only when $α$ is large enough do they provide consistent and reliable error estimations (we give convergence rates); iii) in the over-parametrized regime $α\!<\!1$ relevant to modern machine learning practice, their predictions are not consistent, even with optimal regularization.