LGJun 12, 2021

A decreasing scaling transition scheme from Adam to SGD

arXiv:2106.06749v216 citationsHas Code
AI Analysis

This addresses a practical problem for deep learning practitioners by offering an incremental improvement in optimization algorithms to balance training speed and generalization.

The paper tackles the trade-off between Adam's fast early training and SGD's better generalization by proposing DSTAdam, a decreasing scaling transition scheme that smoothly switches from Adam to SGD, achieving improved performance on CIFAR-10/100 datasets.

Adaptive gradient algorithm (AdaGrad) and its variants, such as RMSProp, Adam, AMSGrad, etc, have been widely used in deep learning. Although these algorithms are faster in the early phase of training, their generalization performance is often not as good as stochastic gradient descent (SGD). Hence, a trade-off method of transforming Adam to SGD after a certain iteration to gain the merits of both algorithms is theoretically and practically significant. To that end, we propose a decreasing scaling transition scheme to achieve a smooth and stable transition from Adam to SGD, which is called DSTAdam. The convergence of the proposed DSTAdam is also proved in an online convex setting. Finally, the effectiveness of the DSTAdam is verified on the CIFAR-10/100 datasets. Our implementation is available at: https://github.com/kunzeng/DSTAdam.

Code Implementations2 repos
Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes