DCLGMLAug 26, 2020

APMSqueeze: A Communication Efficient Adam-Preconditioned Momentum SGD Algorithm

arXiv:2008.11343v28 citations
AI Analysis

This addresses communication bottlenecks for parallelizing Adam in training tasks like BERT and ImageNet, offering a significant speed-up but is incremental as it builds on existing Adam and compression methods.

The paper tackles the problem of Adam's incompatibility with gradient compression, which causes communication bottlenecks in parallel training, by proposing APMSqueeze, a communication-efficient algorithm that achieves similar convergence to Adam in epochs while reducing per-epoch running time and providing up to 2-10x speed-up in end-to-end performance.

Adam is the important optimization algorithm to guarantee efficiency and accuracy for training many important tasks such as BERT and ImageNet. However, Adam is generally not compatible with information (gradient) compression technology. Therefore, the communication usually becomes the bottleneck for parallelizing Adam. In this paper, we propose a communication efficient {\bf A}DAM {\bf p}reconditioned {\bf M}omentum SGD algorithm-- named APMSqueeze-- through an error compensated method compressing gradients. The proposed algorithm achieves a similar convergence efficiency to Adam in term of epochs, but significantly reduces the running time per epoch. In terms of end-to-end performance (including the full-precision pre-condition step), APMSqueeze is able to provide {sometimes by up to $2-10\times$ speed-up depending on network bandwidth.} We also conduct theoretical analysis on the convergence and efficiency.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes