LG OCOct 11, 2021

Momentum Centering and Asynchronous Update for Adaptive Gradient Methods

Juntang Zhuang, Yifan Ding, Tommy Tang, Nicha Dvornek, Sekhar Tatikonda, James S. Duncan

arXiv:2110.05454v37.58 citationsHas Code

Originality Incremental advance

AI Analysis

This work addresses optimization challenges in deep learning for researchers and practitioners, offering a method that enhances training efficiency and generalization, though it is incremental as it builds on existing adaptive optimizers.

The paper tackles the problem of improving convergence and stability in adaptive gradient methods by proposing ACProp, which combines momentum centering and asynchronous updates, achieving a convergence rate of O(1/√T) that matches the oracle rate and outperforms methods like Adam and RMSProp in tasks such as image classification, GAN training, reinforcement learning, and transformers.

We propose ACProp (Asynchronous-centering-Prop), an adaptive optimizer which combines centering of second momentum and asynchronous update (e.g. for $t$-th update, denominator uses information up to step $t-1$, while numerator uses gradient at $t$-th step). ACProp has both strong theoretical properties and empirical performance. With the example by Reddi et al. (2018), we show that asynchronous optimizers (e.g. AdaShift, ACProp) have weaker convergence condition than synchronous optimizers (e.g. Adam, RMSProp, AdaBelief); within asynchronous optimizers, we show that centering of second momentum further weakens the convergence condition. We demonstrate that ACProp has a convergence rate of $O(\frac{1}{\sqrt{T}})$ for the stochastic non-convex case, which matches the oracle rate and outperforms the $O(\frac{logT}{\sqrt{T}})$ rate of RMSProp and Adam. We validate ACProp in extensive empirical studies: ACProp outperforms both SGD and other adaptive optimizers in image classification with CNN, and outperforms well-tuned adaptive optimizers in the training of various GAN models, reinforcement learning and transformers. To sum up, ACProp has good theoretical properties including weak convergence condition and optimal convergence rate, and strong empirical performance including good generalization like SGD and training stability like Adam. We provide the implementation at https://github.com/juntang-zhuang/ACProp-Optimizer.

View on arXiv PDF Code

Similar