LGFeb 12, 2022

Maximizing Communication Efficiency for Large-scale Training via 0/1 Adam

arXiv:2202.06009v322 citations
Originality Highly original
AI Analysis

This addresses communication bottlenecks in distributed training of large models like BERT and GPT, offering significant speed-ups for AI researchers and practitioners.

The paper tackles the problem of slow convergence when applying 1-bit gradient compression or local steps to Adam-based large model pre-training, proposing 0/1 Adam to linearize Adam steps and enable simultaneous use of these techniques, resulting in up to 87% data volume reduction, 54% fewer communication rounds, and 2× higher training throughput while maintaining accuracy.

1-bit gradient compression and local steps are two representative techniques that enable drastic communication reduction in distributed SGD. Their benefits, however, remain an open question on Adam-based large model pre-training (e.g. BERT and GPT). In this paper, we demonstrate the non-linearity in Adam causes slow convergence even when 1-bit compression or local steps are individually applied. To alleviate this limitation, we propose 0/1 Adam that linearizes each Adam step via approximating its optimizer states using their stale estimates and linear correlation. 0/1 Adam performs an Adam-like step to preserve the adaptivity, while its linearity allows utilizing 1-bit compression and local steps simultaneously for wall-clock time speed up. We provide convergence guarantee for 0/1 Adam on smooth non-convex objectives. On various large-scale benchmarks such as BERT-Base, BERT-Large, GPT-2 pre-training and ImageNet, we demonstrate on up to 128 GPUs that 0/1 Adam is able to reduce up to 87% of data volume, 54% of communication rounds, and achieve up to 2$\times$ higher training throughput and end-to-end training time reduction compared to the state-of-the-art baseline 1-bit Adam; while enjoying the same statistical convergence speed and end task model accuracy on GLUE dataset and ImageNet validation set.

Code Implementations1 repo
Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes