LG DCFeb 23, 2024

MegaScale: Scaling Large Language Model Training to More Than 10,000 GPUs

Ziheng Jiang, Haibin Lin, Yinmin Zhong, Qi Huang, Yangrui Chen, Zhi Zhang, Yanghua Peng, Xiang Li, Cong Xie, Shibiao Nong, Yulu Jia, Sun He

arXiv:2402.15627v139.3325 citationsh-index: 18Has CodeNSDI

Originality Incremental advance

AI Analysis

This work addresses efficiency and stability issues in production-scale LLM training for AI researchers and engineers, though it is incremental as it builds on existing systems like Megatron-LM.

The authors tackled the challenge of training large language models (LLMs) at an unprecedented scale of over 10,000 GPUs, achieving 55.2% Model FLOPs Utilization (MFU) for a 175B model on 12,288 GPUs, which is a 1.34x improvement over Megatron-LM.

We present the design, implementation and engineering experience in building and deploying MegaScale, a production system for training large language models (LLMs) at the scale of more than 10,000 GPUs. Training LLMs at this scale brings unprecedented challenges to training efficiency and stability. We take a full-stack approach that co-designs the algorithmic and system components across model block and optimizer design, computation and communication overlapping, operator optimization, data pipeline, and network performance tuning. Maintaining high efficiency throughout the training process (i.e., stability) is an important consideration in production given the long extent of LLM training jobs. Many hard stability issues only emerge at large scale, and in-depth observability is the key to address them. We develop a set of diagnosis tools to monitor system components and events deep in the stack, identify root causes, and derive effective techniques to achieve fault tolerance and mitigate stragglers. MegaScale achieves 55.2% Model FLOPs Utilization (MFU) when training a 175B LLM model on 12,288 GPUs, improving the MFU by 1.34x compared to Megatron-LM. We share our operational experience in identifying and fixing failures and stragglers. We hope by articulating the problems and sharing our experience from a systems perspective, this work can inspire future LLM systems research.

View on arXiv PDF Code

Similar