DCAILGSep 23, 2024

Domino: Eliminating Communication in LLM Training via Generic Tensor Slicing and Overlapping

arXiv:2409.15241v118 citationsh-index: 25
Originality Incremental advance
AI Analysis

This addresses the bottleneck of communication inefficiency for researchers and practitioners training LLMs at scale, representing an incremental improvement over existing methods.

The paper tackles the communication overhead problem in distributed large language model (LLM) training by proposing Domino, a method that hides communication behind computation through tensor slicing and overlapping, achieving up to 1.3x speedup compared to Megatron-LM on Nvidia DGX-H100 GPUs.

Given the popularity of generative AI, Large Language Models (LLMs) often consume hundreds or thousands of GPUs for parallelizing and accelerating the training process. Communication overhead becomes more pronounced when training LLMs at scale. To eliminate communication overhead in distributed LLM training, we propose Domino, which provides a generic scheme to hide communication behind computation. By breaking data dependency of a single batch training into smaller independent pieces, Domino pipelines these independent pieces training and provides generic strategy of fine-grained communication and computation overlapping. Extensive results show that, comparing with Megatron-LM, Domino achieves up to 1.3x speedup for LLM training on Nvidia DGX-H100 GPUs.

Code Implementations1 repo
Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes