LGJun 17, 2021

Implicit Bias of SGD for Diagonal Linear Networks: a Provable Benefit of Stochasticity

arXiv:2106.09524v3126 citations
Originality Incremental advance
AI Analysis

This provides theoretical insight into why SGD outperforms gradient descent in practice, addressing a fundamental problem in machine learning optimization for researchers and practitioners.

The paper tackles the implicit bias of stochastic gradient descent (SGD) in overparametrized neural networks by analyzing diagonal linear networks, proving that SGD always achieves better generalization than gradient descent, with the biasing effect linked to slower convergence speed of the training loss.

Understanding the implicit bias of training algorithms is of crucial importance in order to explain the success of overparametrised neural networks. In this paper, we study the dynamics of stochastic gradient descent over diagonal linear networks through its continuous time version, namely stochastic gradient flow. We explicitly characterise the solution chosen by the stochastic flow and prove that it always enjoys better generalisation properties than that of gradient flow. Quite surprisingly, we show that the convergence speed of the training loss controls the magnitude of the biasing effect: the slower the convergence, the better the bias. To fully complete our analysis, we provide convergence guarantees for the dynamics. We also give experimental results which support our theoretical claims. Our findings highlight the fact that structured noise can induce better generalisation and they help explain the greater performances observed in practice of stochastic gradient descent over gradient descent.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes