LGAIJun 3, 2024

ACCO: Accumulate While You Communicate for Communication-Overlapped Sharded LLM Training

arXiv:2406.02613v35 citations
Originality Incremental advance
AI Analysis

This addresses scalability issues in distributed LLM training for researchers and practitioners, though it is incremental as it builds on existing sharded optimization methods.

The paper tackles the communication overhead and memory inefficiency in distributed LLM training by proposing ACCO, a memory-efficient optimization algorithm that synchronizes delayed gradients while computing new ones, resulting in significantly faster training and effective scaling across heterogeneous hardware compared to ZeRO-1.

Training LLMs relies on distributed implementations using multiple GPUs to compute gradients in parallel with sharded optimizers. However, synchronizing gradients in data parallel setups introduces communication overhead that grows with the number of workers, limiting parallelization efficiency. Local optimization algorithms reduce communications but incur high memory costs as they prevent optimizer state sharding, hindering scalability. To address this, we propose \textbf{AC}cumulate while \textbf{CO}mmunicate (ACCO), a memory-efficient optimization algorithm for distributed LLM training. By synchronizing delayed gradients while computing new ones, ACCO reduces GPU idle time and supports heterogeneous hardware. To mitigate the convergence issues caused by delayed updates, we introduce a novel technique ensuring training dynamics align with standard distributed optimization. Compared to ZeRO-1, our approach is significantly faster and scales effectively across heterogeneous hardware.

Code Implementations1 repo
Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes