MLLGMay 5

Adapt or Forget: Provable Tradeoffs Between Adam and SGD in Nonstationary Optimization

arXiv:2605.0426982.1h-index: 1
AI Analysis

This work offers the first rigorous theoretical characterization of when adaptive methods like Adam are beneficial or harmful in nonstationary settings, addressing a key gap for practitioners dealing with distribution shift.

The paper provides a theoretical analysis of Adam in nonstationary optimization, deriving finite-time bounds that reveal a noise-drift tradeoff: Adam outperforms SGD in noise-dominated regimes but can be worse in drift-dominated regimes, explaining its empirical instability under distribution shift.

We provide a theoretical analysis of Adam under non-stationary stochastic objectives, separating two regimes: Euclidean tracking under adaptive strong monotonicity of the Adam-preconditioned mean-gradient operator, and high-probability projected stationarity guarantees under general $L$-smooth objectives. In the tracking regime, we derive finite-time expected and high-probability bounds that decompose sharply into four components: initialization, objective drift, a first-moment tracking error governed by $β_1$, and a preconditioner perturbation governed by $β_2$. We characterize the burn-in time to reach Adam's irreducible tracking floor under constant and step-decay schedules. We also prove a high-probability bound on the average projected stationarity gap for Adam under distribution shift. Across both analyses, our bounds reveal a noise--drift tradeoff: in noise-dominated regimes, first-moment averaging and adaptive preconditioning can improve the high-probability error, whereas in drift-dominated regimes, stale first-moment information and preconditioner perturbations can compound the cost of nonstationarity, allowing vanilla SGD to achieve a smaller tracking floor. Our explicit $(β_1,β_2,ε)$-dependent bounds delineate when adaptive step-sizing is beneficial versus harmful, and provide a theoretical mechanism for Adam's empirical instability and stabilization under distribution shift.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes