LGCLNov 19, 2024

Selective Attention: Enhancing Transformer through Principled Context Control

arXiv:2411.12892v121 citationsh-index: 39NIPS
Originality Incremental advance
AI Analysis

This work addresses attention dilution and optimization issues in transformers, offering a lightweight method for enhancing existing LLMs, though it is incremental as it builds on established attention mechanisms.

The paper tackles the problem of uniform query treatment in transformer self-attention, which hinders contextual sparsity and relevance control, by introducing Selective Self-Attention (SSA) with temperature scaling, resulting in consistent accuracy improvements on language modeling benchmarks.

The attention mechanism within the transformer architecture enables the model to weigh and combine tokens based on their relevance to the query. While self-attention has enjoyed major success, it notably treats all queries $q$ in the same way by applying the mapping $V^\top\text{softmax}(Kq)$, where $V,K$ are the value and key embeddings respectively. In this work, we argue that this uniform treatment hinders the ability to control contextual sparsity and relevance. As a solution, we introduce the $\textit{Selective Self-Attention}$ (SSA) layer that augments the softmax nonlinearity with a principled temperature scaling strategy. By controlling temperature, SSA adapts the contextual sparsity of the attention map to the query embedding and its position in the context window. Through theory and experiments, we demonstrate that this alleviates attention dilution, aids the optimization process, and enhances the model's ability to control softmax spikiness of individual queries. We also incorporate temperature scaling for value embeddings and show that it boosts the model's ability to suppress irrelevant/noisy tokens. Notably, SSA is a lightweight method which introduces less than 0.5% new parameters through a weight-sharing strategy and can be fine-tuned on existing LLMs. Extensive empirical evaluations demonstrate that SSA-equipped models achieve a noticeable and consistent accuracy improvement on language modeling benchmarks.

Code Implementations1 repo
Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes