LGOCMLFeb 9, 2020

Momentum Improves Normalized SGD

arXiv:2002.03305v2186 citations
AI Analysis

This provides improved convergence guarantees for non-convex optimization in machine learning, though it appears incremental as it builds on existing normalized SGD methods.

The paper tackles the problem of normalized SGD requiring large batch sizes for non-convex optimization by showing that adding momentum eliminates this need, achieving an ε-critical point in O(1/ε^{3.5}) iterations for bounded second derivative objectives, matching best-known rates without extra factors. It also demonstrates effectiveness on ResNet-50 and BERT pretraining, matching state-of-the-art performance.

We provide an improved analysis of normalized SGD showing that adding momentum provably removes the need for large batch sizes on non-convex objectives. Then, we consider the case of objectives with bounded second derivative and show that in this case a small tweak to the momentum formula allows normalized SGD with momentum to find an $ε$-critical point in $O(1/ε^{3.5})$ iterations, matching the best-known rates without accruing any logarithmic factors or dependence on dimension. We also provide an adaptive method that automatically improves convergence rates when the variance in the gradients is small. Finally, we show that our method is effective when employed on popular large scale tasks such as ResNet-50 and BERT pretraining, matching the performance of the disparate methods used to get state-of-the-art results on both tasks.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes