MLLGSTJun 10, 2020

On the Optimal Weighted $\ell_2$ Regularization in Overparameterized Linear Regression

arXiv:2006.05800v4148 citations
Originality Incremental advance
AI Analysis

This work provides theoretical insights for statisticians and machine learning researchers on regularization in high-dimensional settings, though it is incremental as it builds on existing asymptotic analyses.

The paper tackles the problem of optimal weighted ridge regression in overparameterized linear models, deriving exact conditions for the optimal ridge parameter and weighting matrix, and showing that negative regularization can be optimal, with theoretical justifications for empirical observations.

We consider the linear model $\mathbf{y} = \mathbf{X} \mathbfβ_\star + \mathbfε$ with $\mathbf{X}\in \mathbb{R}^{n\times p}$ in the overparameterized regime $p>n$. We estimate $\mathbfβ_\star$ via generalized (weighted) ridge regression: $\hat{\mathbfβ}_λ= \left(\mathbf{X}^T\mathbf{X} + λ\mathbfΣ_w\right)^\dagger \mathbf{X}^T\mathbf{y}$, where $\mathbfΣ_w$ is the weighting matrix. Under a random design setting with general data covariance $\mathbfΣ_x$ and anisotropic prior on the true coefficients $\mathbb{E}\mathbfβ_\star\mathbfβ_\star^T = \mathbfΣ_β$, we provide an exact characterization of the prediction risk $\mathbb{E}(y-\mathbf{x}^T\hat{\mathbfβ}_λ)^2$ in the proportional asymptotic limit $p/n\rightarrow γ\in (1,\infty)$. Our general setup leads to a number of interesting findings. We outline precise conditions that decide the sign of the optimal setting $λ_{\rm opt}$ for the ridge parameter $λ$ and confirm the implicit $\ell_2$ regularization effect of overparameterization, which theoretically justifies the surprising empirical observation that $λ_{\rm opt}$ can be negative in the overparameterized regime. We also characterize the double descent phenomenon for principal component regression (PCR) when both $\mathbf{X}$ and $\mathbfβ_\star$ are anisotropic. Finally, we determine the optimal weighting matrix $\mathbfΣ_w$ for both the ridgeless ($λ\to 0$) and optimally regularized ($λ= λ_{\rm opt}$) case, and demonstrate the advantage of the weighted objective over standard ridge regression and PCR.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes