Aggregated Momentum: Stability Through Passive Damping
This addresses a stability problem for practitioners using momentum in optimization, offering an incremental improvement over existing methods.
The paper tackles the instability of momentum methods in gradient-based optimization by proposing Aggregated Momentum (AggMo), which combines multiple velocity vectors with different damping coefficients to dampen oscillations, enabling stable use of aggressive parameters like 0.999 and often delivering faster convergence.
Momentum is a simple and widely used trick which allows gradient-based optimizers to pick up speed along low curvature directions. Its performance depends crucially on a damping coefficient $β$. Large $β$ values can potentially deliver much larger speedups, but are prone to oscillations and instability; hence one typically resorts to small values such as 0.5 or 0.9. We propose Aggregated Momentum (AggMo), a variant of momentum which combines multiple velocity vectors with different $β$ parameters. AggMo is trivial to implement, but significantly dampens oscillations, enabling it to remain stable even for aggressive $β$ values such as 0.999. We reinterpret Nesterov's accelerated gradient descent as a special case of AggMo and analyze rates of convergence for quadratic objectives. Empirically, we find that AggMo is a suitable drop-in replacement for other momentum methods, and frequently delivers faster convergence.