DC LGMar 14, 2024

BurstAttention: An Efficient Distributed Attention Framework for Extremely Long Sequences

Ao Sun, Weilin Zhao, Xu Han, Cheng Yang, Zhiyuan Liu, Chuan Shi, Maosong Sun

arXiv:2403.09347v411 citations

Originality Incremental advance

AI Analysis

This work addresses the quadratic time and memory complexities in attention modules for long sequences, offering a solution for researchers and practitioners in natural language processing and AI, though it is incremental as it builds on existing distributed approaches.

The paper tackles the challenge of processing extremely long sequences in Transformer-based large language models by proposing BurstAttention, a distributed attention framework that reduces communication overheads by 40% and achieves a 1.37x speedup during training on 128K sequence lengths with 32 A100 GPUs.

Effective attention modules have played a crucial role in the success of Transformer-based large language models (LLMs), but the quadratic time and memory complexities of these attention modules also pose a challenge when processing long sequences. One potential solution for the long sequence problem is to utilize distributed clusters to parallelize the computation of attention modules across multiple devices (e.g., GPUs). However, adopting a distributed approach inevitably introduces extra memory overheads to store local attention results and incurs additional communication costs to aggregate local results into global ones. In this paper, we propose a distributed attention framework named ``BurstAttention'' to optimize memory access and communication operations at both the global cluster and local device levels. In our experiments, we compare BurstAttention with other competitive distributed attention solutions for long sequence processing. The experimental results under different length settings demonstrate that BurstAttention offers significant advantages for processing long sequences compared with these competitive baselines, reducing 40% communication overheads and achieving 1.37 X speedup during training 128K sequence length on 32 X A100.

View on arXiv PDF

Similar