Attention Condensation via Sparsity Induced Regularized Training
This addresses the efficiency bottleneck for deploying Large Language Models, but it is incremental as it builds on existing sparsity frameworks and has only been tested on smaller models so far.
The paper tackles the problem of self-attention dominating inference time in transformers as context windows expand, by proposing a sparsity-induced regularized training method to accelerate attention computation with minimal performance degradation, showing effectiveness in initial evaluations with GPT-2 where attention matrices become sparse and capture relevant dependencies.
As the context window expands, self-attention increasingly dominates the transformer's inference time. Therefore, accelerating attention computation while minimizing performance degradation is essential for the efficient deployment of Large Language Models (LLMs). In this study we extend a theoretical framework of attention sparsity in LLMs. A customized loss function is designed to enforce the sparsity by restricting the number of top elements in the attention matrix. We perform an initial set of evaluations with GPT-2 to show the effectiveness of our sparsification approach. The attention matrices of the models trained with the proposed loss are both sparse and effective in capturing relevant input dependencies. We now continue working to demonstrate the value of our approach on larger models and different architectures.