LGJan 16, 2025

Overshoot: Taking advantage of future gradients in momentum-based stochastic optimization

arXiv:2501.09556v12 citationsh-index: 2
Originality Incremental advance
AI Analysis

This addresses the need for more efficient training in machine learning, though it is incremental as it builds on existing momentum methods.

The paper tackles the problem of slow convergence in momentum-based stochastic optimization by proposing Overshoot, a method that calculates gradients at weights shifted by current momentum, resulting in at least 15% faster convergence on average compared to standard and Nesterov's momentum.

Overshoot is a novel, momentum-based stochastic gradient descent optimization method designed to enhance performance beyond standard and Nesterov's momentum. In conventional momentum methods, gradients from previous steps are aggregated with the gradient at current model weights before taking a step and updating the model. Rather than calculating gradient at the current model weights, Overshoot calculates the gradient at model weights shifted in the direction of the current momentum. This sacrifices the immediate benefit of using the gradient w.r.t. the exact model weights now, in favor of evaluating at a point, which will likely be more relevant for future updates. We show that incorporating this principle into momentum-based optimizers (SGD with momentum and Adam) results in faster convergence (saving on average at least 15% of steps). Overshoot consistently outperforms both standard and Nesterov's momentum across a wide range of tasks and integrates into popular momentum-based optimizers with zero memory and small computational overhead.

Code Implementations1 repo
Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes