LGApr 19

A unified convergence theory for adaptive first-order methods in the nonconvex case, including AdaNorm, full and diagonal AdaGrad, Shampoo and Muo

arXiv:2604.1742318.12 citationsh-index: 8
Predicted impact top 81% in LG · last 90 daysOriginality Highly original
AI Analysis

Provides a theoretical foundation for a broad class of adaptive optimizers used in deep learning, unifying their analysis and enabling heterogeneous preconditioning across variable groups.

The paper proposes a unified convergence theory for adaptive first-order methods (AdaGrad, AdaNorm, Shampoo, Muon) in nonconvex optimization, proving global convergence rates under mild assumptions without bounded gradients or small stepsizes.

A unified framework for first-order optimization algorithms fornonconvex unconstrained optimization is proposed that uses adaptivelypreconditioned gradients and includes popular methods such as full anddiagonal AdaGrad, AdaNorm, as well as adpative variants of Shampoo andMuon. This framework also allows combining heterogeneous geometriesacross different groups of variables while preserving a unifiedconvergence analysis. A fully stochastic global rate-of-convergenceanalysis is conducted for all methods in the framework, with andwithout two types of momentum, using reasonable assumptions on thevariance of the gradient oracle and without assuming boundedstochastic gradients or small enough stepsize.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes