Yueyuan Zhou

2papers

2 Papers

10.0DCMay 6
CCL-D: A High-Precision Diagnostic System for Slow and Hang Anomalies in Large-Scale Model Training

Yida Gu, Fakang Wang, Jianhao Fu et al.

As training scales grow, collective communication libraries (CCL) increasingly face anomalies arising from complex interactions among hardware, software, and environmental factors. These anomalies typically manifest as slow/hang communication, the most frequent and time-consuming category to diagnose. However, traditional diagnostic methods remain inaccurate and inefficient, frequently requiring hours or even days for root cause analysis. To address this, we propose CCL-D, a high-precision diagnostic system designed to detect and locate slow/hang anomalies in large-scale distributed training. CCL-D integrates a rank-level real-time probe with an intelligent decision analyzer. The probe measures cross-layer anomaly metrics using a lightweight distributed tracing framework to monitor communication traffic. The analyzer performs automated anomaly detection and root-cause location, precisely identifying the faulty GPU rank. Deployed on a 4,000-GPU cluster over one year, CCL-D achieved near-complete coverage of known slow/hang anomalies and pinpointed affected ranks within 6 minutes-substantially outperforming existing solutions.

CVOct 11, 2019
CHD:Consecutive Horizontal Dropout for Human Gait Feature Extraction

Chengtao Cai, Yueyuan Zhou, Yanming Wang

Despite gait recognition and person re-identification researches have made a lot of progress, the accuracy of identification is not high enough in some specific situations, for example, people carrying bags or changing coats. In order to alleviate above situations, we propose a simple but effective Consecutive Horizontal Dropout (CHD) method apply on human feature extraction in deep learning network to avoid overfitting. Within the CHD, we intensify the robust of deep learning network for cross-view gait recognition and person re-identification. The experiments illustrate that the rank-1 accuracy on cross-view gait recognition task has been increased about 10% from 68.0% to 78.201% and 8% from 83.545% to 91.364% in person re-identification task in wearing coat or jacket condition. In addition, 100% accuracy of NM condition was first obtained with CHD. On the benchmarks of CASIA-B, above accuracies are state-of-the-arts.