Accelerate Distributed Stochastic Descent for Nonconvex Optimization with Momentum
This work addresses distributed training efficiency for nonconvex optimization, likely benefiting machine learning practitioners, but it appears incremental as it builds on existing momentum and model averaging methods.
The paper tackles the problem of accelerating distributed stochastic descent for nonconvex optimization by introducing a momentum method called block momentum, which applies momentum at the global learner level in model averaging approaches, and experimental results show it accelerates training and achieves better results.
Momentum method has been used extensively in optimizers for deep learning. Recent studies show that distributed training through K-step averaging has many nice properties. We propose a momentum method for such model averaging approaches. At each individual learner level traditional stochastic gradient is applied. At the meta-level (global learner level), one momentum term is applied and we call it block momentum. We analyze the convergence and scaling properties of such momentum methods. Our experimental results show that block momentum not only accelerates training, but also achieves better results.