LGOCMLJun 8, 2022

Beyond the Edge of Stability via Two-step Gradient Updates

arXiv:2206.04172v326 citationsh-index: 15
Originality Incremental advance
AI Analysis

This addresses a fundamental theoretical gap in understanding unstable convergence behavior in overparameterized machine learning models, though it focuses on simplified settings rather than practical applications.

The paper investigates why Gradient Descent can converge even with large learning rates beyond the Edge of Stability, where theoretical guarantees break down. It provides a theoretical condition involving third-order derivatives that ensures convergence to fixed points in simple learning problems, and demonstrates period-2 orbits in high-dimensional matrix factorization.

Gradient Descent (GD) is a powerful workhorse of modern machine learning thanks to its scalability and efficiency in high-dimensional spaces. Its ability to find local minimisers is only guaranteed for losses with Lipschitz gradients, where it can be seen as a `bona-fide' discretisation of an underlying gradient flow. Yet, many ML setups involving overparametrised models do not fall into this problem class, which has motivated research beyond the so-called ``Edge of Stability'' (EoS), where the step-size crosses the admissibility threshold inversely proportional to the Lipschitz constant above. Perhaps surprisingly, GD has been empirically observed to still converge regardless of local instability and oscillatory behavior. The incipient theoretical analysis of this phenomena has mainly focused in the overparametrised regime, where the effect of choosing a large learning rate may be associated to a `Sharpness-Minimisation' implicit regularisation within the manifold of minimisers, under appropriate asymptotic limits. In contrast, in this work we directly examine the conditions for such unstable convergence, focusing on simple, yet representative, learning problems, via analysis of two-step gradient updates. Specifically, we characterize a local condition involving third-order derivatives that guarantees existence and convergence to fixed points of the two-step updates, and leverage such property in a teacher-student setting, under population loss. Finally, starting from Matrix Factorization, we provide observations of period-2 orbit of GD in high-dimensional settings with intuition of its dynamics, along with exploration into more general settings.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes