NIMay 12

Avoiding Cross-Datacenter Collective Congestion via Disaggregated Buffering

Mariano Scazzariello, Noga H. Rotman, Dima Gavrilenko, Sajy Khashab, Alexander Shpiner, Matty Kadosh, Marco Chiesa, Dejan Kostic, Mark Silberstein

arXiv:2605.118528.3

Predicted impact top 32% in NI · last 90 daysOriginality Incremental advance

AI Analysis

For large-scale LLM training spanning multiple datacenters, Spillway solves a critical but overlooked congestion bottleneck without requiring host or framework changes.

Spillway addresses congestion collapse from cross-datacenter collectives colliding with intra-DC traffic in LLM training, reducing iteration time by up to 14% via in-network disaggregated buffering.

LLM training at the scale of tens of thousands of GPUs now spans multiple datacenters (DC), making cross-DC collectives over long-haul links unavoidable. A critical and overlooked bottleneck arises when these collectives collide with intra-DC traffic at the destination - a common pattern in real workloads. The multi-millisecond congestion control loop is too slow to react, triggering severe packet loss and congestion collapse. We present Spillway, a transparent in-network mechanism that buffers dropped packets in switch-disaggregated buffers in a destination data center and drains them once congestion subsides. Through large-scale end-to-end simulations and a hardware prototype, we show that Spillway eliminates performance degradation from collective collisions, reducing iteration time by up to 14 %, without changes to end hosts or training frameworks.

View on arXiv PDF

Similar