LGFeb 12, 2022

Maximizing Communication Efficiency for Large-scale Training via 0/1 Adam

Yucheng Lu, Conglong Li, Minjia Zhang, Christopher De Sa, Yuxiong He

arXiv:2202.06009v311.822 citationsHas Code

Originality Highly original

AI Analysis

This addresses communication bottlenecks in distributed training of large models like BERT and GPT, offering significant speed-ups for AI researchers and practitioners.

The paper tackles the problem of slow convergence when applying 1-bit gradient compression or local steps to Adam-based large model pre-training, proposing 0/1 Adam to linearize Adam steps and enable simultaneous use of these techniques, resulting in up to 87% data volume reduction, 54% fewer communication rounds, and 2× higher training throughput while maintaining accuracy.

1-bit gradient compression and local steps are two representative techniques that enable drastic communication reduction in distributed SGD. Their benefits, however, remain an open question on Adam-based large model pre-training (e.g. BERT and GPT). In this paper, we demonstrate the non-linearity in Adam causes slow convergence even when 1-bit compression or local steps are individually applied. To alleviate this limitation, we propose 0/1 Adam that linearizes each Adam step via approximating its optimizer states using their stale estimates and linear correlation. 0/1 Adam performs an Adam-like step to preserve the adaptivity, while its linearity allows utilizing 1-bit compression and local steps simultaneously for wall-clock time speed up. We provide convergence guarantee for 0/1 Adam on smooth non-convex objectives. On various large-scale benchmarks such as BERT-Base, BERT-Large, GPT-2 pre-training and ImageNet, we demonstrate on up to 128 GPUs that 0/1 Adam is able to reduce up to 87% of data volume, 54% of communication rounds, and achieve up to 2$\times$ higher training throughput and end-to-end training time reduction compared to the state-of-the-art baseline 1-bit Adam; while enjoying the same statistical convergence speed and end task model accuracy on GLUE dataset and ImageNet validation set.

View on arXiv PDF Code

Similar