LGDCJul 1, 2025

HelixPipe: Efficient Distributed Training of Long Sequence Transformers with Attention Parallel Pipeline Parallelism

arXiv:2507.00394v11 citationsh-index: 10Has CodePPoPP
Originality Incremental advance
AI Analysis

This work addresses performance bottlenecks in distributed training of long sequence transformers for researchers and practitioners in large-scale AI, though it is incremental as it builds on existing pipeline parallelism methods.

The paper tackles the inefficiency of existing pipeline parallelism for long sequence transformer training by proposing HelixPipe, which introduces attention parallel partition and optimized scheduling to reduce pipeline bubbles and memory overhead, achieving a 26% speedup over baseline methods when training a 7B model with 128k sequence length on 64 H20 GPUs.

As transformer sequence lengths grow, existing pipeline parallelisms incur suboptimal performance due to the quadratic attention computation and the substantial memory overhead. To relieve these challenges, we propose HelixPipe, a novel pipeline parallelism for long sequence transformer training. First, HelixPipe introduces attention parallel partition, which schedules attention computations of different micro batches across different pipeline stages in parallel, reducing pipeline bubbles. Second, it employs a two-fold first-in-last-out micro batch schedule to balance memory usage and overlap communication with computation. Additionally, HelixPipe utilizes recomputation without attention and chunked MLP to mitigate fragmentation and enable longer sequences. Experiments demonstrate that HelixPipe gains increasing advantages with longer sequence lengths, and outperforms existing methods in throughput and scalability across varying pipeline sizes, model sizes, and cluster configurations. Notably, it achieves a 26\% speedup over baseline methods when training a 7B model with 128k sequence length on 64 H20 GPUs. Code is available at https://github.com/code-tunnel/Megatron-LM/tree/dev.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes