DCApr 20

Chameleon: Adaptive Fault Tolerance for Distributed Training via Real-time Policy Selection

arXiv:2508.216139.0h-index: 5
Predicted impact top 28% in DC · last 90 daysOriginality Incremental advance
AI Analysis

For large-scale distributed training, Chameleon reduces performance degradation from faults without compromising convergence or memory.

Chameleon adaptively selects optimal fault-tolerance strategies during distributed training, achieving within 11% of failure-free performance and up to 1.355x higher throughput than existing methods.

Training large language models faces frequent interruptions due to various faults, demanding robust fault-tolerance. Existing backup-free methods, such as redundant computation, dynamic parallelism, and data rerouting, each incur performance penalties, whether from ongoing overhead, lengthy reconfigurations, or post-recovery inefficiencies. We propose Chameleon, an adaptive fault-tolerant system that intelligently selects optimal recovery strategies when a failure occurs. Chameleon achieves this through a unified performance model, expedient execution plan search, accurate performance estimation, and efficient communication optimizations. Experiments on a 32-card cluster show that Chameleon maintains a performance gap of within 11.00% between post-recovery and failure-free training, while preserving model convergence and efficient memory usage. Compared to state-of-the-art methods, Chameleon achieves up to 1.229x and 1.355x higher average throughput than Oobleck and Recycle, respectively.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes