LGOCMLMar 8, 2024

Leveraging Continuous Time to Understand Momentum When Training Diagonal Linear Networks

arXiv:2403.05293v111 citationsh-index: 17AISTATS
Originality Incremental advance
AI Analysis

This work provides incremental insights into optimization dynamics for machine learning practitioners, focusing on a specific network architecture.

The paper tackled the problem of understanding how momentum affects optimization trajectories in gradient descent, and found that a specific parameter combination helps recover sparse solutions in overparameterized regression settings.

In this work, we investigate the effect of momentum on the optimisation trajectory of gradient descent. We leverage a continuous-time approach in the analysis of momentum gradient descent with step size $γ$ and momentum parameter $β$ that allows us to identify an intrinsic quantity $λ= \frac{ γ}{ (1 - β)^2 }$ which uniquely defines the optimisation path and provides a simple acceleration rule. When training a $2$-layer diagonal linear network in an overparametrised regression setting, we characterise the recovered solution through an implicit regularisation problem. We then prove that small values of $λ$ help to recover sparse solutions. Finally, we give similar but weaker results for stochastic momentum gradient descent. We provide numerical experiments which support our claims.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes