NIAILGJul 22, 2023

Rail-only: A Low-Cost High-Performance Network for Training LLMs with Trillion Parameters

arXiv:2307.12169v562 citationsh-index: 22
Originality Incremental advance
AI Analysis

This addresses the problem of expensive and inefficient network infrastructure for hyperscale LLM training, offering a low-cost solution for AI researchers and companies, though it is incremental as it optimizes existing network designs rather than introducing a new paradigm.

The paper tackles the high cost and power consumption of datacenter networks for training large language models (LLMs) by proposing a Rail-only network architecture that eliminates the spine layer, reducing network cost by 38-77% and power consumption by 37-75% while maintaining training performance.

This paper presents a low-cost network architecture for training large language models (LLMs) at hyperscale. We study the optimal parallelization strategy of LLMs and propose a novel datacenter network design tailored to LLM's unique communication pattern. We show that LLM training generates sparse communication patterns in the network and, therefore, does not require any-to-any full-bisection network to complete efficiently. As a result, our design eliminates the spine layer in traditional GPU clusters. We name this design a Rail-only network and demonstrate that it achieves the same training performance while reducing the network cost by 38% to 77% and network power consumption by 37% to 75% compared to a conventional GPU datacenter. Our architecture also supports Mixture-of-Expert (MoE) models with all-to-all communication through forwarding, with only 8.2% to 11.2% completion time overhead for all-to-all traffic. We study the failure robustness of Rail-only networks and provide insights into the performance impact of different network and training parameters.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes