LGJun 12, 2021

Scaling transition from momentum stochastic gradient descent to plain stochastic gradient descent

arXiv:2106.06753v11 citationsHas Code
Originality Incremental advance
AI Analysis

This addresses an incremental improvement in optimization algorithms for deep learning practitioners, offering a hybrid approach to enhance training efficiency and model performance.

The paper tackles the trade-off between training speed and accuracy in stochastic gradient descent by proposing a scaling transition method (TSGD) that shifts from momentum SGD to plain SGD, achieving faster training, higher accuracy, and better stability in experiments.

The plain stochastic gradient descent and momentum stochastic gradient descent have extremely wide applications in deep learning due to their simple settings and low computational complexity. The momentum stochastic gradient descent uses the accumulated gradient as the updated direction of the current parameters, which has a faster training speed. Because the direction of the plain stochastic gradient descent has not been corrected by the accumulated gradient. For the parameters that currently need to be updated, it is the optimal direction, and its update is more accurate. We combine the advantages of the momentum stochastic gradient descent with fast training speed and the plain stochastic gradient descent with high accuracy, and propose a scaling transition from momentum stochastic gradient descent to plain stochastic gradient descent(TSGD) method. At the same time, a learning rate that decreases linearly with the iterations is used instead of a constant learning rate. The TSGD algorithm has a larger step size in the early stage to speed up the training, and training with a smaller step size in the later stage can steadily converge. Our experimental results show that the TSGD algorithm has faster training speed, higher accuracy and better stability. Our implementation is available at: https://github.com/kunzeng/TSGD.

Code Implementations5 repos
Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes