LGMLFeb 24, 2021

On the Validity of Modeling SGD with Stochastic Differential Equations (SDEs)

arXiv:2102.12470v2102 citations
AI Analysis

This work addresses a foundational issue in machine learning theory for researchers and practitioners, providing a more rigorous basis for understanding SGD dynamics, though it is incremental in refining existing approximations.

The paper tackles the problem of justifying the approximation of finite-learning-rate SGD with stochastic differential equations (SDEs), which is widely used but lacks formal validation. It introduces an efficient simulation algorithm (SVAG) and a testable condition, showing that the SDE approximation can capture training and generalization in deep nets.

It is generally recognized that finite learning rate (LR), in contrast to infinitesimal LR, is important for good generalization in real-life deep nets. Most attempted explanations propose approximating finite-LR SGD with Ito Stochastic Differential Equations (SDEs), but formal justification for this approximation (e.g., (Li et al., 2019)) only applies to SGD with tiny LR. Experimental verification of the approximation appears computationally infeasible. The current paper clarifies the picture with the following contributions: (a) An efficient simulation algorithm SVAG that provably converges to the conventionally used Ito SDE approximation. (b) A theoretically motivated testable necessary condition for the SDE approximation and its most famous implication, the linear scaling rule (Goyal et al., 2017), to hold. (c) Experiments using this simulation to demonstrate that the previously proposed SDE approximation can meaningfully capture the training and generalization properties of common deep nets.

Code Implementations1 repo
Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes