DCAILGNov 16, 2024

Improving training time and GPU utilization in geo-distributed language model training

arXiv:2411.14458v210 citationsh-index: 10Has Code
Originality Highly original
AI Analysis

This addresses the challenge of GPU scarcity and inefficiency in distributed training for AI researchers and practitioners, representing a strong incremental improvement over existing methods.

The paper tackles the problem of training large language models across multiple geo-distributed datacenters by introducing Atlas and BubbleTea, which together achieve up to 17x faster training and up to 94% GPU utilization.

The widespread adoption of language models (LMs) has caused a huge surge in demand for GPUs. Training large LMs requires tens of thousands of GPUs and housing them in the same datacenter (DC) is a challenge due to many constraints including availability of peak power. We focus on training such models across multiple DCs connected via the Wide-Area-Network (WAN). We built Atlas that speeds up the training time using novel workload-aware temporal bandwidth sharing and other design choices. While Atlas improves the training time, it does not completely eliminate the bubbles (idle GPU cycles). We built BubbleTea that runs prefill-as-a-service (part of LM inference) during the bubbles thus improving the GPU utilization without any impact on training. Compared to state-of-the-art designs, Atlas and BubbleTea together achieve up to 17x faster training, and up to 94% GPU utilization. The code will be open-sourced.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes