Accelerating Asynchronous Stochastic Gradient Descent for Neural Machine Translation
This addresses efficiency bottlenecks in neural machine translation training for researchers and practitioners, though it appears incremental.
The paper tackles the problem of poor convergence in asynchronous stochastic gradient descent when increasing mini-batch size for speedup, introducing local optimizers and momentum tuning to mitigate stale gradients, resulting in training a shallow machine translation system 27% faster than an optimized baseline with negligible BLEU penalty.
In order to extract the best possible performance from asynchronous stochastic gradient descent one must increase the mini-batch size and scale the learning rate accordingly. In order to achieve further speedup we introduce a technique that delays gradient updates effectively increasing the mini-batch size. Unfortunately with the increase of mini-batch size we worsen the stale gradient problem in asynchronous stochastic gradient descent (SGD) which makes the model convergence poor. We introduce local optimizers which mitigate the stale gradient problem and together with fine tuning our momentum we are able to train a shallow machine translation system 27% faster than an optimized baseline with negligible penalty in BLEU.