NALGOCMLJul 27, 2020

On the Regularization Effect of Stochastic Gradient Descent applied to Least Squares

arXiv:2007.13288v2
Originality Incremental advance
AI Analysis

This provides theoretical insight into SGD's implicit regularization for optimization problems, which is incremental but clarifies a known phenomenon in machine learning.

The paper tackles the regularization effect of stochastic gradient descent (SGD) applied to least squares problems, showing that SGD leads to quick regularization by smoothing errors from large to small singular vectors, with explicit convergence bounds depending on the matrix properties.

We study the behavior of stochastic gradient descent applied to $\|Ax -b \|_2^2 \rightarrow \min$ for invertible $A \in \mathbb{R}^{n \times n}$. We show that there is an explicit constant $c_{A}$ depending (mildly) on $A$ such that $$ \mathbb{E} ~\left\| Ax_{k+1}-b\right\|^2_{2} \leq \left(1 + \frac{c_{A}}{\|A\|_F^2}\right) \left\|A x_k -b \right\|^2_{2} - \frac{2}{\|A\|_F^2} \left\|A^T A (x_k - x)\right\|^2_{2}.$$ This is a curious inequality: the last term has one more matrix applied to the residual $u_k - u$ than the remaining terms: if $x_k - x$ is mainly comprised of large singular vectors, stochastic gradient descent leads to a quick regularization. For symmetric matrices, this inequality has an extension to higher-order Sobolev spaces. This explains a (known) regularization phenomenon: an energy cascade from large singular values to small singular values smoothes.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes