DCAIMay 6

CCL-D: A High-Precision Diagnostic System for Slow and Hang Anomalies in Large-Scale Model Training

arXiv:2605.0447847.01 citationsh-index: 6
Predicted impact top 33% in DC · last 90 daysOriginality Incremental advance
AI Analysis

For practitioners of large-scale distributed training, CCL-D dramatically reduces diagnosis time from hours/days to minutes, addressing a critical bottleneck in anomaly detection.

CCL-D is a diagnostic system for slow/hang anomalies in large-scale model training, achieving near-complete coverage of known anomalies and pinpointing affected GPU ranks within 6 minutes on a 4,000-GPU cluster.

As training scales grow, collective communication libraries (CCL) increasingly face anomalies arising from complex interactions among hardware, software, and environmental factors. These anomalies typically manifest as slow/hang communication, the most frequent and time-consuming category to diagnose. However, traditional diagnostic methods remain inaccurate and inefficient, frequently requiring hours or even days for root cause analysis. To address this, we propose CCL-D, a high-precision diagnostic system designed to detect and locate slow/hang anomalies in large-scale distributed training. CCL-D integrates a rank-level real-time probe with an intelligent decision analyzer. The probe measures cross-layer anomaly metrics using a lightweight distributed tracing framework to monitor communication traffic. The analyzer performs automated anomaly detection and root-cause location, precisely identifying the faulty GPU rank. Deployed on a 4,000-GPU cluster over one year, CCL-D achieved near-complete coverage of known slow/hang anomalies and pinpointed affected ranks within 6 minutes-substantially outperforming existing solutions.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes