LGFeb 20, 2024

On the Convergence of Gradient Descent for Large Learning Rates

arXiv:2402.13108v33 citationsh-index: 13
Originality Highly original
AI Analysis

This addresses a fundamental gap in optimization theory for machine learning practitioners, offering rigorous insights into convergence limits in practical scenarios.

The paper tackles the problem of whether gradient descent with a fixed step size can converge from any initialization when the step size is too large, proving impossibility results that show a sharp transition in convergence behavior as step size exceeds a critical value. It provides proofs for linear neural networks with squared loss and extends to more general losses without strong assumptions, validated through experiments with non-linear networks.

A vast literature on convergence guarantees for gradient descent and derived methods exists at the moment. However, a simple practical situation remains unexplored: when a fixed step size is used, can we expect gradient descent to converge starting from any initialization? We provide fundamental impossibility results showing that convergence becomes impossible no matter the initialization if the step size gets too big. Looking at the asymptotic value of the gradient norm along the optimization trajectory, we see that there is a sharp transition as the step size crosses a critical value. This has been observed by practitioners, yet the true mechanisms through which this happens remain unclear beyond heuristics. Using results from dynamical systems theory, we provide a proof of this in the case of linear neural networks with a squared loss. We also prove the impossibility of convergence for more general losses without requiring strong assumptions such as Lipschitz continuity for the gradient. We validate our findings through experiments with non-linear networks.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes