USP: A Unified Sequence Parallelism Approach for Long Context Generative AI
This work addresses the problem of efficiently training large generative models with long sequences for AI researchers and practitioners, representing an incremental improvement over existing methods.
The paper tackles the challenge of enabling long-context capabilities in generative AI models by proposing a unified sequence parallelism approach that is robust to transformer architectures and network hardware, achieving 47% MFU on LLAMA3-8B training with a 208K sequence length.
Sequence parallelism (SP), which divides the sequence dimension of input tensors across multiple computational devices, is becoming key to unlocking the long-context capabilities of generative AI models. This paper investigates the state-of-the-art SP approaches, i.e. DeepSpeed-Ulysses and Ring-Attention, and proposes a unified SP approach, which is more robust to transformer model architectures and network hardware topology. This paper compares the communication and memory cost of SP and existing parallelism, including data/tensor/zero/pipeline parallelism, and discusses the best practices for designing hybrid 4D parallelism involving SP. We achieved 47% MFU on two 8xA800 nodes using SP for the LLAMA3-8B model training using sequence length 208K. Our code is publicly available at https://github.com/feifeibear/long-context-attention.