Adaptive Preconditioners Trigger Loss Spikes in Adam
This addresses the problem of training instability in deep learning for practitioners using Adam, providing a specific explanation for loss spikes, though it is incremental as it builds on prior work on optimizer behavior.
The paper identifies that Adam's adaptive preconditioners, not loss landscape sharpness, trigger training loss spikes by causing a critical regime where second-order moment estimates decay too slowly, leading to instability when the preconditioned Hessian's maximum eigenvalue exceeds the stability threshold. This mechanism was verified through experiments on various neural network architectures.
Loss spikes emerge commonly during training across neural networks of varying architectures and scales when using the Adam optimizer. In this work, we investigate the underlying mechanism responsible for Adam spikes. While previous explanations attribute these phenomena to the lower-loss-as-sharper characteristics of the loss landscape, our analysis reveals that Adam's adaptive preconditioners themselves can trigger spikes. Specifically, we identify a critical regime where squared gradients become substantially smaller than the second-order moment estimates, causing the latter to undergo a $β_2$-exponential decay and to respond sluggishly to current gradient information. This mechanism can push the maximum eigenvalue of the preconditioned Hessian beyond the classical stability threshold $2/η$ for a sustained period, inducing instability. This instability further leads to an alignment between the gradient and the maximum eigendirection, and a loss spike occurs precisely when the gradient-directional curvature exceeds $2/η$. We verify this mechanism through extensive experiments on fully connected networks, convolutional networks, and Transformer architectures.