Demystifying Pipeline Parallelism: First Theory for PipeDream
For distributed training practitioners, this work offers a theoretical foundation and scaling diagnosis for pipeline parallelism, though the practical guidance is limited by simulated experiments and mixed results.
The paper provides the first theoretical convergence guarantee for PipeDream-style pipeline parallelism via a novel Randomized PipeDream abstraction, and shows that staleness scales quartically with the number of stages. In simulated experiments, PipeDream outperforms LocalSGD on quadratic objectives and small language modeling, while LocalSGD is better for logistic regression with many stages.
Training modern machine learning models increasingly requires computation to be distributed across many accelerators. Data parallelism remains the default choice and is often paired with tensor-parallel sharding, but model parallelism becomes unavoidable once parameters, activations, or optimizer states no longer fit on a single device. This paper studies pipeline model parallelism through the lens of PipeDream (PD) (Harlap et al., 2018). Our first contribution is theoretical: we introduce Randomized PipeDream (RPD), a stale block-SGD abstraction that yields, to our knowledge, the first clean nonconvex convergence guarantee for a PD-style method. Our second contribution is a scaling diagnosis: we prove that the delay induced by steady-state PD grows as $S^2 - S/2 + O(1)$ for $S$ stages, so the stale-read contribution in the convergence theorem scales as $Θ(γ^2 S^4)$, equivalently as $Θ(S^4/K)$ in the tuned-rate form. Our third contribution is a comparison with LocalSGD, whose periodic model averaging trades weight staleness for synchronization bubbles. In our reported simulated-time experiments, the better-performing method depends on the objective: PD performs better on the quadratic objective and on a small language-modeling training-loss task, while for logistic regression LocalSGD becomes superior as the number of stages increases.