DCAILGPFDec 17, 2024

TrainMover: An Interruption-Resilient and Reliable ML Training Runtime

arXiv:2412.12636v22 citationsh-index: 9
Originality Incremental advance
AI Analysis

This addresses reliability issues for users of large-scale ML training, though it appears incremental as it builds on existing checkpointing and reconfiguration methods.

The paper tackles the problem of frequent interruptions in large-scale ML training jobs by introducing TrainMover, a resilient runtime that uses standby machines to achieve second-level downtime and maintain 99% training efficiency during rebalancing.

Large-scale ML training jobs are frequently interrupted by hardware and software anomalies, failures, and management events. Existing solutions like checkpointing or runtime reconfiguration suffer from long downtimes, degraded performance, or undesired changes to training strategies. We present TrainMover, a resilient runtime that leverages standby machines to handle interruptions with minimal downtime and zero memory overhead. To achieve these goals, TrainMover introduces two key techniques: two-phase, delta-based communication group setups and communication-free sandboxed shadow iterations. Our evaluation shows that TrainMover consistently achieves second-level downtime across all evaluated models during migration, maintaining 99\% training efficiency during periodic 10-minute rebalancing. We also demonstrate the effectiveness of TrainMover in handling various interruptions.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes