LG AIOct 17, 2025

SNOO: Step-K Nesterov Outer Optimizer - The Surprising Effectiveness of Nesterov Momentum Applied to Pseudo-Gradients

Dominik Kallusky, Vinay Rao, Vishal Nandavanam, Hao-Jun Michael Shi

Baidu

arXiv:2510.15830v16 citationsh-index: 6

Originality Incremental advance

AI Analysis

This provides a practical enhancement for training large models with minimal overhead, though it is incremental as it builds on existing Lookahead and DiLoCo methods.

The paper tackles the problem of inefficient optimization for large language models by showing that applying Nesterov momentum to pseudo-gradients in a Lookahead framework improves training, achieving compute factor gains of 1.5-2.5x in non-distributed settings up to 1e23 FLOPs.

The rapid development of large language models (LLMs) has driven the demand for more efficient optimization techniques. Among these, the Lookahead family of optimizers employs a two-loop framework, maintaining fast and slow sets of model weights. Multiple inner optimizer steps on the fast weights produce a trajectory - the pseudo-gradient - that is used to update the slow weights. DiLoCo, a notable example originally designed for distributed training, applies Nesterov momentum to the averaged pseudo-gradient from multiple workers, claiming to even outperform AdamW in a non-distributed setup. In this paper, we empirically show that DiLoCo's surprising effectiveness stems primarily from applying Nesterov momentum to the pseudo-gradient, which improves training in a non-distributed setting. We call this Lookahead variant the Step-$K$ Nesterov Outer Optimizer (SNOO). We demonstrate that SNOO achieves compute factor gains of 1.5 - 2.5$\times$ in a non-distributed setting up to a scale of 1e23 training FLOPs, with improvements that increase with model size. Because of its minimal compute and memory overhead and compatibility with model sharding, SNOO is a practical enhancement for a variety of inner optimizers, including AdamW and Muon.

View on arXiv PDF

Similar