TrainMover: An Interruption-Resilient and Reliable ML Training Runtime
This addresses reliability issues for users of large-scale ML training, though it appears incremental as it builds on existing checkpointing and reconfiguration methods.
The paper tackles the problem of frequent interruptions in large-scale ML training jobs by introducing TrainMover, a resilient runtime that uses standby machines to achieve second-level downtime and maintain 99% training efficiency during rebalancing.
Large-scale ML training jobs are frequently interrupted by hardware and software anomalies, failures, and management events. Existing solutions like checkpointing or runtime reconfiguration suffer from long downtimes, degraded performance, or undesired changes to training strategies. We present TrainMover, a resilient runtime that leverages standby machines to handle interruptions with minimal downtime and zero memory overhead. To achieve these goals, TrainMover introduces two key techniques: two-phase, delta-based communication group setups and communication-free sandboxed shadow iterations. Our evaluation shows that TrainMover consistently achieves second-level downtime across all evaluated models during migration, maintaining 99\% training efficiency during periodic 10-minute rebalancing. We also demonstrate the effectiveness of TrainMover in handling various interruptions.