LG AI CLOct 14, 2024

HSR-Enhanced Sparse Attention Acceleration

Bo Chen, Yingyu Liang, Zhizhou Sha, Zhenmei Shi, Zhao Song

arXiv:2410.10165v221.124 citationsh-index: 21CPAL

Originality Incremental advance

AI Analysis

This work addresses the problem of inefficient long-context processing in LLMs, which is a bottleneck for applications requiring extended sequences, and represents an incremental improvement by optimizing existing attention mechanisms rather than introducing a new paradigm.

The paper tackles the computational complexity of attention mechanisms in Large Language Models (LLMs) for long-context tasks by introducing a novel approach using Half-Space Reporting (HSR) to accelerate attention computation, achieving running times of O(mn^{4/5}) for generation decoding and O(mn^{1 - 1 / ⌊d/2⌋} + mn^{4/5}) for prompt prefilling, with provably negligible error for Softmax attention.

Large Language Models (LLMs) have demonstrated remarkable capabilities across various applications, but their performance on long-context tasks is often limited by the computational complexity of attention mechanisms. We introduce a novel approach to accelerate attention computation in LLMs, particularly for long-context scenarios. We leverage the inherent sparsity within attention mechanisms, both in conventional Softmax attention and ReLU attention (with $\mathsf{ReLU}^α$ activation, $α\in \mathbb{N}_+$), to significantly reduce the running time complexity. Our method employs a Half-Space Reporting (HSR) data structure to identify non-zero or ``massively activated'' entries in the attention matrix. We present theoretical analyses for two key scenarios: generation decoding and prompt prefilling. Our approach achieves a running time of $O(mn^{4/5})$ significantly faster than the naive approach $O(mn)$ for generation decoding, where $n$ is the context length, $m$ is the query length, and $d$ is the hidden dimension. We can also reduce the running time for prompt prefilling from $O(mn)$ to $O(mn^{1 - 1 / \lfloor d/2\rfloor} + mn^{4/5})$. Our method introduces only provably negligible error for Softmax attention. This work represents a significant step towards enabling efficient long-context processing in LLMs.

View on arXiv PDF

Similar