DC LGJul 2, 2025

Distributed Training under Packet Loss

arXiv:2507.07114v12.31 citationsh-index: 6

Originality Highly original

AI Analysis

This work addresses the critical gap in enabling robust, high-throughput distributed training for large models on commodity or wide-area networks, which is incremental by building on existing distributed frameworks but with novel guarantees.

The paper tackles the problem of distributed training over unreliable connections, which can cause accuracy loss and convergence issues due to packet loss, and introduces a framework that achieves unbiased gradient aggregation and bounded parameter drift, resulting in at most 0.8% perplexity change on the LLAMA2 7B model with 64 GPUs under 10% packet loss.

State-of-the-art language and vision models are routinely trained across thousands of GPUs, often spanning multiple data-centers, yet today's distributed frameworks still assume reliable connections (e.g., InfiniBand or RoCE). The resulting acknowledgment traffic and retransmissions inflate tail latencies and limit scalability. Leveraging unreliable connections will reduce latency but may sacrifice model accuracy and convergence once packets are dropped. A principled, end-to-end solution that preserves accuracy and convergence guarantees under genuine packet loss has previously been missing. We address this critical gap by introducing a novel distributed training framework capable of operating over unreliable connections, offering unbiased gradient aggregation and bounded parameter drift without modifying model code or optimizers. The key insight is a two-stage defense against missing messages: (i) Unbiased gradient aggregation: each worker reconstructs a consistent gradient estimate from whatever packets arrive, guaranteeing expectation-level correctness; and (ii) Bounded-drift parameter broadcasts: we prove the inter-worker model discrepancy remains O(1) even after arbitrarily many iterations, preventing the unbounded divergence typical of asynchronous setups. Analytical bounds are matched by experiments on the LLAMA2 7B model with 64 GPUs: tolerating 10% random packet loss yields at most 0.8% perplexity change. This work bridges the gap between communication-efficient datacenter protocols and the accuracy and generalization guarantees demanded by modern large-model training, enabling robust, high-throughput learning on commodity or wide-area networks.

View on arXiv PDF

Similar