MLLGSep 3, 2025

Understanding and Improving Shampoo and SOAP via Kullback-Leibler Minimization

arXiv:2509.03378v48 citationsh-index: 12
Originality Highly original
AI Analysis

This work addresses optimization bottlenecks for neural network training by providing a more efficient and theoretically grounded alternative to existing structured second-moment methods.

The paper tackled the performance and memory overhead limitations of Shampoo and SOAP optimizers by recasting them under Kullback-Leibler divergence minimization, leading to KL-Shampoo and KL-SOAP, which match or exceed existing methods in neural network pre-training with improved efficiency, notably eliminating Adam's memory overhead.

Shampoo and its efficient variant, SOAP, employ structured second-moment estimations and have shown strong performance for training neural networks (NNs). In practice, however, Shampoo typically requires step-size grafting with Adam to be competitive, and SOAP mitigates this by applying Adam in Shampoo's eigenbasis -- at the cost of additional memory overhead from Adam in both methods. Prior analyses have largely relied on the Frobenius norm to motivate these estimation schemes. We instead recast their estimation procedures as covariance estimation under Kullback-Leibler (KL) divergence minimization, revealing a previously overlooked theoretical limitation and motivating principled redesigns. Building on this perspective, we develop $\textbf{KL-Shampoo}$ and $\textbf{KL-SOAP}$, practical schemes that match or exceed the performance of Shampoo and SOAP in NN pre-training while achieving SOAP-level per-iteration runtime. Notably, KL-Shampoo does not rely on Adam to attain competitive performance, eliminating the memory overhead introduced by Adam. Across our experiments, KL-Shampoo consistently outperforms SOAP, Shampoo, and even KL-SOAP, establishing the KL-based approach as a compelling foundation for designing structured methods in NN optimization.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes