DCAIMay 19, 2025

Learning In Chaos: Efficient Autoscaling and Self-Healing for Multi-Party Distributed Training

arXiv:2505.12815v2h-index: 17
Originality Highly original
AI Analysis

This addresses slow and centralized autoscaling for self-governed, cross-region training clusters, offering improved efficiency for institutions in distributed AI workloads.

The paper tackles the problem of node and link churn disrupting multi-party distributed training over wide-area networks by proposing Chaos, a system with self-healing and autoscaling that reduces scale-out delay compared to existing methods and handles events within 20ms.

Node and link churn in multi-party, cross-region clusters over wide-area networks (WANs) often disrupts distributed training. However, checkpoint-based recovery and cloud-centric autoscaling react slowly and assume centralized control, which is misaligned with the self-governed setup where institutions can freely join and leave. This paper proposes Chaos, a multi-party distributed training system with self-healing and autoscaling, enabling robust and elastic training under churn. It speeds up autoscaling via multi-neighbor state replication and model sharding. We formalize the sharding and assignment as a MINLP that captures WAN heterogeneity, and reduce it to a tractable MILP by analyzing its monotonicity on a divisibility chain. By establishing an equivalence, we derive a greedy algorithm that follows optimality rules and yields the optimal solution in polynomial time. Chaos uses a cluster monitor to track resource and topology changes, and handles scaling events through peer negotiation protocols, enabling fully self-governed autoscaling among institutions. Experiments show that Chaos has substantially lower scale-out delay than Pollux, Elan, and Autoscaling, and handles scale-in, connect-link, and disconnect-link events within 20ms. It also delivers the lowest idle time, showing superior resource use and scalability as the cluster grows.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes