LG DC MLJul 26, 2019

Taming Momentum in a Distributed Asynchronous Environment

Ido Hakimi, Saar Barkai, Moshe Gabel, Assaf Schuster

arXiv:1907.11612v37.126 citations

Originality Highly original

AI Analysis

This addresses a bottleneck in scaling deep learning training efficiently for researchers and practitioners using large clusters.

The paper tackles the problem of gradient staleness in distributed asynchronous training with momentum, which hinders convergence, by proposing DANA, a technique that computes gradients on estimated future parameter positions. Evaluation on CIFAR and ImageNet shows DANA outperforms existing methods in final accuracy and convergence speed, scaling to batch sizes of 16K on 64 workers.

Although distributed computing can significantly reduce the training time of deep neural networks, scaling the training process while maintaining high efficiency and final accuracy is challenging. Distributed asynchronous training enjoys near-linear speedup, but asynchrony causes gradient staleness - the main difficulty in scaling stochastic gradient descent to large clusters. Momentum, which is often used to accelerate convergence and escape local minima, exacerbates the gradient staleness, thereby hindering convergence. We propose DANA: a novel technique for asynchronous distributed SGD with momentum that mitigates gradient staleness by computing the gradient on an estimated future position of the model's parameters. Thereby, we show for the first time that momentum can be fully incorporated in asynchronous training with almost no ramifications to final accuracy. Our evaluation on the CIFAR and ImageNet datasets shows that DANA outperforms existing methods, in both final accuracy and convergence speed while scaling up to a total batch size of 16K on 64 asynchronous workers.

View on arXiv PDF

Similar