LGOct 15, 2025

Adam or Gauss-Newton? A Comparative Study In Terms of Basis Alignment and SGD Noise

HarvardMicrosoft
arXiv:2510.13680v13 citationsh-index: 96
Originality Synthesis-oriented
AI Analysis

This work provides theoretical insights into optimizer performance for deep learning practitioners, but it is incremental as it compares existing methods without introducing new algorithms.

The paper compares Adam and Gauss-Newton diagonal preconditioners by analyzing basis alignment and SGD noise effects, showing that Adam can outperform Gauss-Newton in full-batch settings and behaves similarly to Gauss-Newton in stochastic regimes for linear regression.

Diagonal preconditioners are computationally feasible approximate to second-order optimizers, which have shown significant promise in accelerating training of deep learning models. Two predominant approaches are based on Adam and Gauss-Newton (GN) methods: the former leverages statistics of current gradients and is the de-factor optimizers for neural networks, and the latter uses the diagonal elements of the Gauss-Newton matrix and underpins some of the recent diagonal optimizers such as Sophia. In this work, we compare these two diagonal preconditioning methods through the lens of two key factors: the choice of basis in the preconditioner, and the impact of gradient noise from mini-batching. To gain insights, we analyze these optimizers on quadratic objectives and logistic regression under all four quadrants. We show that regardless of the basis, there exist instances where Adam outperforms both GN$^{-1}$ and GN$^{-1/2}$ in full-batch settings. Conversely, in the stochastic regime, Adam behaves similarly to GN$^{-1/2}$ for linear regression under a Gaussian data assumption. These theoretical results are supported by empirical studies on both convex and non-convex objectives.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes