DC AIMay 6

CCL-D: A High-Precision Diagnostic System for Slow and Hang Anomalies in Large-Scale Model Training

Yida Gu, Fakang Wang, Jianhao Fu, Zhenhang Sun, Qianyu Zhang, Hairui Zhao, Xingchen Liu, Yang Tian, Wenjing Huang, Zedong Liu, Yifan Chen, Jinwu Yang

arXiv:2605.0447847.01 citationsh-index: 6

Predicted impact top 33% in DC · last 90 daysOriginality Incremental advance

AI Analysis

For practitioners of large-scale distributed training, CCL-D dramatically reduces diagnosis time from hours/days to minutes, addressing a critical bottleneck in anomaly detection.

CCL-D is a diagnostic system for slow/hang anomalies in large-scale model training, achieving near-complete coverage of known anomalies and pinpointing affected GPU ranks within 6 minutes on a 4,000-GPU cluster.

As training scales grow, collective communication libraries (CCL) increasingly face anomalies arising from complex interactions among hardware, software, and environmental factors. These anomalies typically manifest as slow/hang communication, the most frequent and time-consuming category to diagnose. However, traditional diagnostic methods remain inaccurate and inefficient, frequently requiring hours or even days for root cause analysis. To address this, we propose CCL-D, a high-precision diagnostic system designed to detect and locate slow/hang anomalies in large-scale distributed training. CCL-D integrates a rank-level real-time probe with an intelligent decision analyzer. The probe measures cross-layer anomaly metrics using a lightweight distributed tracing framework to monitor communication traffic. The analyzer performs automated anomaly detection and root-cause location, precisely identifying the faulty GPU rank. Deployed on a 4,000-GPU cluster over one year, CCL-D achieved near-complete coverage of known slow/hang anomalies and pinpointed affected ranks within 6 minutes-substantially outperforming existing solutions.

View on arXiv PDF

Similar