LGMay 20, 2022

On the SDEs and Scaling Rules for Adaptive Gradient Algorithms

Sadhika Malladi, Kaifeng Lyu, Abhishek Panigrahi, Sanjeev Arora

Tsinghua

arXiv:2205.10287v330.4110 citationsh-index: 23Has Code

Originality Incremental advance

AI Analysis

This work addresses the challenge of analyzing adaptive gradient methods for researchers and practitioners in machine learning, providing tools for better optimization in large-scale vision and language settings, though it is incremental by building on existing SDE frameworks.

The paper derived rigorously proven stochastic differential equation (SDE) approximations for adaptive gradient algorithms like RMSprop and Adam, enabling theoretical analysis and practical application, and introduced a square root scaling rule for hyperparameter adjustment with batch size changes, validated empirically in deep learning tasks.

Approximating Stochastic Gradient Descent (SGD) as a Stochastic Differential Equation (SDE) has allowed researchers to enjoy the benefits of studying a continuous optimization trajectory while carefully preserving the stochasticity of SGD. Analogous study of adaptive gradient methods, such as RMSprop and Adam, has been challenging because there were no rigorously proven SDE approximations for these methods. This paper derives the SDE approximations for RMSprop and Adam, giving theoretical guarantees of their correctness as well as experimental validation of their applicability to common large-scaling vision and language settings. A key practical result is the derivation of a $\textit{square root scaling rule}$ to adjust the optimization hyperparameters of RMSprop and Adam when changing batch size, and its empirical validation in deep learning settings.

View on arXiv PDF Code

Similar