LGOCJan 30

Adaptive Momentum and Nonlinear Damping for Neural Network Training

arXiv:2602.00334v13 citationsh-index: 4
Originality Incremental advance
AI Analysis

This addresses optimization stability issues for practitioners training large models like ViT, BERT, and GPT2, but it is incremental as it builds on existing methods like mSGD and Adam.

The paper tackles the challenge of maintaining stability and convergence speed in large-scale neural network optimization by proposing a continuous-time scheme with adaptive momentum coefficients and cubic damping. The result is that the methods match or outperform Adam on training ViT, BERT, and GPT2 tasks, with theoretical exponential convergence established.

We propose a continuous-time scheme for large-scale optimization that introduces individual, adaptive momentum coefficients regulated by the kinetic energy of each model parameter. This approach automatically adjusts to local landscape curvature to maintain stability without sacrificing convergence speed. We demonstrate that our adaptive friction can be related to cubic damping, a suppression mechanism from structural dynamics. Furthermore, we introduce two specific optimization schemes by augmenting the continuous dynamics of mSGD and Adam with a cubic damping term. Empirically, our methods demonstrate robustness and match or outperform Adam on training ViT, BERT, and GPT2 tasks where mSGD typically struggles. We further provide theoretical results establishing the exponential convergence of the proposed schemes.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes