Accelerating Sparse Transformer Inference on GPU
For practitioners deploying large language models, STOF provides a practical speedup for sparse Transformer inference on GPU, though the gains are incremental.
STOF accelerates sparse Transformer inference on GPU by optimizing multi-head attention with row-wise/blockwise kernels and enabling adaptive operator fusion, achieving up to 1.6x speedup in MHA and 1.4x in end-to-end inference over prior work.
Large language models (LLMs) are popular around the world due to their powerful understanding capabilities. As the core component of LLMs, accelerating Transformer through parallelization has gradually become a hot research topic. Mask layers introduce sparsity into Transformer to reduce calculations. However, previous works rarely focus on the performance optimization of sparse Transformer. In addition, current static operator fusion schemes fail to adapt to diverse application scenarios. To address the above problems, we propose STOF, a framework that incorporates optimizations for Sparse Transformer that enables flexible masking and Operator Fusion on GPU. For multi-head attention (MHA) structure, STOF maps the computation to row-wise or blockwise kernels with unique storage formats according to analytical modeling. For downstream operators, STOF maps the fusion scheme to compilation templates and determines the optimal running configuration through two-stage searching. The experimental results show that compared to the stateof-the-art work, STOF achieves maximum speedups of 1.6x in MHA computation and 1.4x in end-to-end inference.