LG DCDec 17, 2024

Echo: Simulating Distributed Training At Scale

Yicheng Feng, Yuetao Chen, Kaiwen Chen, Jingzong Li, Tianyuan Wu, Peng Cheng, Chuan Wu, Wei Wang, Tsung-Yi Ho, Hong Xu

arXiv:2412.12487v16.410 citationsh-index: 6

Originality Incremental advance

AI Analysis

This work provides a more accurate and efficient simulation tool for managing massive ML clusters and distributed training jobs, though it is incremental in improving existing simulation methods.

The paper tackles the challenge of simulating large-scale distributed training by addressing key issues in workload tracing, communication estimation, and computation slowdown, achieving an average 8% error in training step time, which is about 3x lower than state-of-the-art simulators.

Simulation offers unique values for both enumeration and extrapolation purposes, and is becoming increasingly important for managing the massive machine learning (ML) clusters and large-scale distributed training jobs. In this paper, we build Echo to tackle three key challenges in large-scale training simulation: (1) tracing the runtime training workloads at each device in an ex-situ fashion so we can use a single device to obtain the actual execution graphs of 1K-GPU training, (2) accurately estimating the collective communication without high overheads of discrete-event based network simulation, and (3) accounting for the interference-induced computation slowdown from overlapping communication and computation kernels on the same device. Echo delivers on average 8% error in training step -- roughly 3x lower than state-of-the-art simulators -- for GPT-175B on a 96-GPU H800 cluster with 3D parallelism on Megatron-LM under 2 minutes.

View on arXiv PDF

Similar