LG OCJan 30

Adaptive Momentum and Nonlinear Damping for Neural Network Training

Aikaterini Karoni, Rajit Rajpal, Benedict Leimkuhler, Gabriel Stoltz

arXiv:2602.00334v13.83 citationsh-index: 4

Originality Incremental advance

AI Analysis

This addresses optimization stability issues for practitioners training large models like ViT, BERT, and GPT2, but it is incremental as it builds on existing methods like mSGD and Adam.

The paper tackles the challenge of maintaining stability and convergence speed in large-scale neural network optimization by proposing a continuous-time scheme with adaptive momentum coefficients and cubic damping. The result is that the methods match or outperform Adam on training ViT, BERT, and GPT2 tasks, with theoretical exponential convergence established.

We propose a continuous-time scheme for large-scale optimization that introduces individual, adaptive momentum coefficients regulated by the kinetic energy of each model parameter. This approach automatically adjusts to local landscape curvature to maintain stability without sacrificing convergence speed. We demonstrate that our adaptive friction can be related to cubic damping, a suppression mechanism from structural dynamics. Furthermore, we introduce two specific optimization schemes by augmenting the continuous dynamics of mSGD and Adam with a cubic damping term. Empirically, our methods demonstrate robustness and match or outperform Adam on training ViT, BERT, and GPT2 tasks where mSGD typically struggles. We further provide theoretical results establishing the exponential convergence of the proposed schemes.

View on arXiv PDF

Similar