DC AIJan 8, 2025

Scaling Large Language Model Training on Frontier with Low-Bandwidth Partitioning

Lang Xu, Quentin Anthony, Jacob Hatef, Aamir Shafi, Hari Subramoni, Dhabaleswar K., Panda

arXiv:2501.04266v21.21 citationsh-index: 31HPC

Originality Incremental advance

AI Analysis

This work addresses communication bottlenecks in distributed LLM training on specific high-performance hardware, representing an incremental improvement over existing methods like ZeRO++.

The paper tackles the problem of scaling large language model training on the Frontier supercomputer by proposing a 3-level hierarchical partitioning strategy to reduce communication overhead, resulting in a 1.71x increase in TFLOPS per GPU and a scaling efficiency of 0.94 for a 20B GPT model compared to ZeRO++.

Scaling up Large Language Model(LLM) training involves fitting a tremendous amount of training parameters across a limited number of workers. However, methods like ZeRO-3 that drastically reduce GPU memory pressure often incur heavy communication to ensure global synchronization and consistency. Established efforts such as ZeRO++ use secondary partitions to avoid inter-node communications, given that intra-node GPU-GPU transfer generally has more bandwidth and lower latency than inter-node connections. However, as more capable infrastructure like Frontier, equipped with AMD GPUs, emerged with impressive computing capability, there is a need for investigations on the hardware topology and to develop targeted strategies to improve training efficiency. In this work, we propose a collection of communication and optimization strategies for ZeRO++ to reduce communication costs and improve memory utilization. In this paper, we propose a 3-level hierarchical partitioning specifically for the current 2nd ranked supercomputing cluster, Frontier, which aims at leveraging various bandwidths across layers of communications (GCD-GCD, GPU-GPU, and inter-node) to reduce communication overhead. For a 20B GPT model, we observe a 1.71x increase in TFLOPS per GPU when compared with ZeRO++ up to 384 GCDs and a scaling efficiency of 0.94 for up to 384 GCDs.

View on arXiv PDF

Similar