LGAIDCPFMay 22, 2023

A 4D Hybrid Algorithm to Scale Parallel Training to Thousands of GPUs

arXiv:2305.13525v310 citations
Originality Incremental advance
AI Analysis

This work addresses a critical performance problem for practitioners scaling large-scale parallel training, though it is incremental as it builds on existing parallelism methods.

The paper tackles the communication bottleneck in scaling billion-parameter neural network training to thousands of GPUs by introducing a 4D hybrid algorithm, achieving a 26% speedup over Megatron-LM and 57% of theoretical peak FLOP/s on an 80-billion parameter GPT model with 1024 GPUs.

Heavy communication, in particular, collective operations, can become a critical performance bottleneck in scaling the training of billion-parameter neural networks to large-scale parallel systems. This paper introduces a four-dimensional (4D) approach to optimize communication in parallel training. This 4D approach is a hybrid of 3D tensor and data parallelism, and is implemented in the AxoNN framework. In addition, we employ two key strategies to further minimize communication overheads. First, we aggressively overlap expensive collective operations (reduce-scatter, all-gather, and all-reduce) with computation. Second, we develop an analytical model to identify high-performing configurations within the large search space defined by our 4D algorithm. This model empowers practitioners by simplifying the tuning process for their specific training workloads. When training an 80-billion parameter GPT on 1024 GPUs of Perlmutter, AxoNN surpasses Megatron-LM, a state-of-the-art framework, by a significant 26%. Additionally, it achieves a significantly high 57% of the theoretical peak FLOP/s or 182 PFLOP/s in total.

Code Implementations1 repo
Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes