LGDSNAOCMLFeb 14, 2020

Stochasticity of Deterministic Gradient Descent: Large Learning Rate for Multiscale Objective Function

arXiv:2002.06189v233 citations
AI Analysis

This addresses the fundamental understanding of optimization dynamics in machine learning, revealing chaotic behaviors in deterministic methods, which is incremental but clarifies a specific regime.

The paper tackles the problem of deterministic Gradient Descent exhibiting stochastic behaviors in multiscale objective functions with large learning rates, showing it converges to a statistical distribution rather than a local minimizer, with theoretical and numerical demonstrations including a condition for approximation by a rescaled Gibbs distribution.

This article suggests that deterministic Gradient Descent, which does not use any stochastic gradient approximation, can still exhibit stochastic behaviors. In particular, it shows that if the objective function exhibit multiscale behaviors, then in a large learning rate regime which only resolves the macroscopic but not the microscopic details of the objective, the deterministic GD dynamics can become chaotic and convergent not to a local minimizer but to a statistical distribution. A sufficient condition is also established for approximating this long-time statistical limit by a rescaled Gibbs distribution. Both theoretical and numerical demonstrations are provided, and the theoretical part relies on the construction of a stochastic map that uses bounded noise (as opposed to discretized diffusions).

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes