DCApr 20

Chameleon: Adaptive Fault Tolerance for Distributed Training via Real-time Policy Selection

Yuhang Zhou, Zhibin Wang, Peng Jiang, Haoran Xia, Junhe Lu, Qianyu Jiang, Rong Gu, Hengxi Xu, Xinjing Huang, Guanghuan Fang, Zhiheng Hu, Jingyi Zhang

arXiv:2508.216139.0h-index: 5

Predicted impact top 28% in DC · last 90 daysOriginality Incremental advance

AI Analysis

For large-scale distributed training, Chameleon reduces performance degradation from faults without compromising convergence or memory.

Chameleon adaptively selects optimal fault-tolerance strategies during distributed training, achieving within 11% of failure-free performance and up to 1.355x higher throughput than existing methods.

Training large language models faces frequent interruptions due to various faults, demanding robust fault-tolerance. Existing backup-free methods, such as redundant computation, dynamic parallelism, and data rerouting, each incur performance penalties, whether from ongoing overhead, lengthy reconfigurations, or post-recovery inefficiencies. We propose Chameleon, an adaptive fault-tolerant system that intelligently selects optimal recovery strategies when a failure occurs. Chameleon achieves this through a unified performance model, expedient execution plan search, accurate performance estimation, and efficient communication optimizations. Experiments on a 32-card cluster show that Chameleon maintains a performance gap of within 11.00% between post-recovery and failure-free training, while preserving model convergence and efficient memory usage. Compared to state-of-the-art methods, Chameleon achieves up to 1.229x and 1.355x higher average throughput than Oobleck and Recycle, respectively.

View on arXiv PDF

Similar