LGOCJul 5, 2024

LoCo: Low-Bit Communication Adaptor for Large-scale Model Training

arXiv:2407.04480v28 citationsh-index: 4
AI Analysis

This addresses the efficiency bottleneck in distributed training of large models, offering a practical improvement for AI researchers and engineers.

The paper tackles the problem of training quality degradation in large-scale model training due to low-bit gradient communication compression by proposing LoCo, a method that compensates gradients before compression, resulting in improved training speed by 14% to 40% without performance loss on models like LLAMAs and MoE.

To efficiently train large-scale models, low-bit gradient communication compresses full-precision gradients on local GPU nodes into low-precision ones for higher gradient synchronization efficiency among GPU nodes. However, it often degrades training quality due to compression information loss. To address this, we propose the Low-bit Communication Adaptor (LoCo), which compensates gradients on local GPU nodes before compression, ensuring efficient synchronization without compromising training quality. Specifically, LoCo designs a moving average of historical compensation errors to stably estimate concurrent compression error and then adopts it to compensate for the concurrent gradient compression, yielding a less lossless compression. This mechanism allows it to be compatible with general optimizers like Adam and sharding strategies like FSDP. Theoretical analysis shows that integrating LoCo into full-precision optimizers like Adam and SGD does not impair their convergence speed on nonconvex problems. Experimental results show that across large-scale model training frameworks like Megatron-LM and PyTorch's FSDP, LoCo significantly improves communication efficiency, e.g., improving Adam's training speed by 14% to 40% without performance degradation on large language models like LLAMAs and MoE.

Code Implementations1 repo
Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes