ST LG MLJun 4, 2019

On the number of variables to use in principal component regression

arXiv:1906.01139v219.533 citations

Originality Synthesis-oriented

AI Analysis

This provides theoretical insights for statisticians and machine learning practitioners on variable selection in high-dimensional regression, though it is incremental as it builds on existing average-case analysis frameworks.

The paper tackles the problem of selecting the number of variables in principal component regression to minimize out-of-sample prediction error, finding that the error exhibits a 'double descent' shape and that minimum risk can occur in the interpolating regime where the number of features exceeds the sample size.

We study least squares linear regression over $N$ uncorrelated Gaussian features that are selected in order of decreasing variance. When the number of selected features $p$ is at most the sample size $n$, the estimator under consideration coincides with the principal component regression estimator; when $p>n$, the estimator is the least $\ell_2$ norm solution over the selected features. We give an average-case analysis of the out-of-sample prediction error as $p,n,N \to \infty$ with $p/N \to α$ and $n/N \to β$, for some constants $α\in [0,1]$ and $β\in (0,1)$. In this average-case setting, the prediction error exhibits a "double descent" shape as a function of $p$. We also establish conditions under which the minimum risk is achieved in the interpolating ($p>n$) regime.

View on arXiv PDF

Similar