DC AINov 4, 2023

Ultra-Long Sequence Distributed Transformer

Xiao Wang, Isaac Lyngaas, Aristeidis Tsaris, Peng Chen, Sajal Dash, Mayanka Chandra Shekar, Tao Luo, Hong-Jun Yoon, Mohamed Wahib, John Gouley

arXiv:2311.02382v23.35 citationsh-index: 20

Originality Highly original

AI Analysis

This addresses the computational and memory bottlenecks in long-sequence transformer training for AI researchers and practitioners, offering a significant improvement over existing methods.

The paper tackles the problem of training transformer models on long sequences by introducing the Long Short-Sequence Transformer (LSS Transformer), a distributed training method that achieves 5.6x faster speed and 10.2x more memory efficiency compared to state-of-the-art sequence parallelism on 144 GPUs, and scales to sequences of 50,112 tokens with 161% super-linear parallel efficiency.

Transformer models trained on long sequences often achieve higher accuracy than short sequences. Unfortunately, conventional transformers struggle with long sequence training due to the overwhelming computation and memory requirements. Existing methods for long sequence training offer limited speedup and memory reduction, and may compromise accuracy. This paper presents a novel and efficient distributed training method, the Long Short-Sequence Transformer (LSS Transformer), for training transformer with long sequences. It distributes a long sequence into segments among GPUs, with each GPU computing a partial self-attention for its segment. Then, it uses a fused communication and a novel double gradient averaging technique to avoid the need to aggregate partial self-attention and minimize communication overhead. We evaluated the performance between LSS Transformer and the state-of-the-art Nvidia sequence parallelism on a Wikipedia enwik8 dataset. Results show that our proposed method lead to 5.6x faster and 10.2x more memory-efficient implementation compared to state-of-the-art sequence parallelism on 144 Nvidia V100 GPUs. Moreover, our algorithm scales to an extreme sequence length of 50,112 at 3,456 GPUs, achieving 161% super-linear parallel efficiency and a throughput of 32 petaflops.

View on arXiv PDF

Similar