On the Provable Suboptimality of Momentum SGD in Nonstationary Stochastic Optimization
For practitioners using momentum in nonstationary optimization (e.g., online learning, reinforcement learning), this paper provides theoretical justification for empirical instability and shows vanilla SGD can be superior.
This paper proves that momentum SGD (Polyak Heavy-Ball and Nesterov) is provably worse than vanilla SGD for tracking time-varying optima under distribution shift, with a drift-amplification penalty that diverges as momentum parameter approaches 1. Minimax lower bounds confirm this is an information-theoretic barrier, not an analytical artifact.
In this paper, we provide a comprehensive theoretical analysis of Stochastic Gradient Descent (SGD) and its momentum variants (Polyak Heavy-Ball and Nesterov) for tracking time-varying optima under strong convexity and smoothness. Our finite-time bounds reveal a sharp decomposition of tracking error into transient, noise-induced, and drift-induced components. This decomposition exposes a fundamental trade-off: while momentum is often used as a gradient-smoothing heuristic, under distribution shift it incurs an explicit drift-amplification penalty that diverges as the momentum parameter $β$ approaches 1, yielding systematic tracking lag. We complement these upper bounds with minimax lower bounds under gradient-variation constraints, proving this momentum-induced tracking penalty is not an analytical artifact but an information-theoretic barrier: in drift-dominated regimes, momentum is unavoidably worse because stale-gradient averaging forces systematic lag. Our results provide theoretical grounding for the empirical instability of momentum in dynamic settings and precisely delineate regime boundaries where vanilla SGD provably outperforms its accelerated counterparts.