LGOCFeb 10

Adaptive Optimization via Momentum on Variance-Normalized Gradients

arXiv:2602.10204v1
Originality Incremental advance
AI Analysis

This addresses optimization instability for deep learning practitioners, offering an incremental improvement over existing Adam-type methods.

The paper tackled the problem of instability in Adam-style optimizers by introducing MVN-Grad, which combines variance-based normalization and momentum to decouple stale momentum from stochastic normalization, resulting in improved stability and performance. It matched or outperformed Adam, AdaBelief, and LaProp on CIFAR-100 and GPT-style benchmarks, with smoother training and better generalization.

We introduce MVN-Grad (Momentum on Variance-Normalized Gradients), an Adam-style optimizer that improves stability and performance by combining two complementary ideas: variance-based normalization and momentum applied after normalization. MVN-Grad scales each coordinate by an exponential moving average of gradient uncertainty and applies momentum to the resulting normalized gradients, eliminating the cross-time coupling between stale momentum and a stochastic normalizer present in standard Adam-type updates. We prove that this decoupling yields strictly smaller one-step conditional update variance than momentum-then-normalize variance methods under standard noise assumptions, and that MVN-Grad is robust to outliers: it has a uniformly bounded response to single gradient spikes. In low-variance regimes, we further show variance normalization avoids sign-type collapse associated with second-moment scaling and can yield accelerated convergence. Across CIFAR-100 image classification and GPT-style language modeling benchmarks, MVN-Grad matches or outperforms Adam, AdaBelief, and LaProp, delivering smoother training and improved generalization with no added overhead.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes