LGAIOct 10, 2025

The Potential of Second-Order Optimization for LLMs: A Study with Full Gauss-Newton

arXiv:2510.09378v112 citationsh-index: 96
Originality Incremental advance
AI Analysis

This work addresses the problem of accelerating LLM pretraining for researchers and practitioners, showing that current approximations may be suboptimal, but it is incremental as it builds on existing second-order optimization methods.

The study investigated the performance loss from approximations in second-order optimization for LLMs by applying full Gauss-Newton preconditioning to transformer models up to 150M parameters, achieving a 5.4x reduction in training iterations compared to strong baselines.

Recent efforts to accelerate LLM pretraining have focused on computationally-efficient approximations that exploit second-order structure. This raises a key question for large-scale training: how much performance is forfeited by these approximations? To probe this question, we establish a practical upper bound on iteration complexity by applying full Gauss-Newton (GN) preconditioning to transformer models of up to 150M parameters. Our experiments show that full GN updates yield substantial gains over existing optimizers, achieving a 5.4x reduction in training iterations compared to strong baselines like SOAP and Muon. Furthermore, we find that a precise layerwise GN preconditioner, which ignores cross-layer information, nearly matches the performance of the full GN method. Collectively, our results suggest: (1) the GN approximation is highly effective for preconditioning, implying higher-order loss terms may not be critical for convergence speed; (2) the layerwise Hessian structure contains sufficient information to achieve most of these potential gains; and (3) a significant performance gap exists between current approximate methods and an idealized layerwise oracle.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes