Leveraging Continuous Time to Understand Momentum When Training Diagonal Linear Networks
This work provides incremental insights into optimization dynamics for machine learning practitioners, focusing on a specific network architecture.
The paper tackled the problem of understanding how momentum affects optimization trajectories in gradient descent, and found that a specific parameter combination helps recover sparse solutions in overparameterized regression settings.
In this work, we investigate the effect of momentum on the optimisation trajectory of gradient descent. We leverage a continuous-time approach in the analysis of momentum gradient descent with step size $γ$ and momentum parameter $β$ that allows us to identify an intrinsic quantity $λ= \frac{ γ}{ (1 - β)^2 }$ which uniquely defines the optimisation path and provides a simple acceleration rule. When training a $2$-layer diagonal linear network in an overparametrised regression setting, we characterise the recovered solution through an implicit regularisation problem. We then prove that small values of $λ$ help to recover sparse solutions. Finally, we give similar but weaker results for stochastic momentum gradient descent. We provide numerical experiments which support our claims.