LG MLFeb 26, 2021

Gradient Descent on Neural Networks Typically Occurs at the Edge of Stability

Jeremy M. Cohen, Simran Kaur, Yuanzhi Li, J. Zico Kolter, Ameet Talwalkar

arXiv:2103.00065v342.3437 citationsHas Code

Originality Incremental advance

AI Analysis

This addresses a foundational problem in machine learning optimization by revealing unexpected behavior that could impact all neural network training methods, though it is incremental in providing empirical evidence rather than a new paradigm.

The paper tackles the problem of understanding gradient descent behavior in neural network training by empirically showing that it typically operates at the Edge of Stability, where the Hessian's maximum eigenvalue hovers just above 2/step size and loss behaves non-monotonically but decreases long-term, challenging widespread optimization presumptions.

We empirically demonstrate that full-batch gradient descent on neural network training objectives typically operates in a regime we call the Edge of Stability. In this regime, the maximum eigenvalue of the training loss Hessian hovers just above the numerical value $2 / \text{(step size)}$, and the training loss behaves non-monotonically over short timescales, yet consistently decreases over long timescales. Since this behavior is inconsistent with several widespread presumptions in the field of optimization, our findings raise questions as to whether these presumptions are relevant to neural network training. We hope that our findings will inspire future efforts aimed at rigorously understanding optimization at the Edge of Stability. Code is available at https://github.com/locuslab/edge-of-stability.

View on arXiv PDF Code

Similar