Mitigating Attention Localization in Small Scale: Self-Attention Refinement via One-step Belief Propagation
This addresses attention localization in transformers, particularly for resource-constrained scenarios, but appears incremental as it refines existing self-attention mechanisms.
The paper tackles the problem of attention localization in transformers, where attention collapses onto limited tokens and fails to capture long-range dependencies, by proposing SAOBP, a refinement framework using belief propagation. The result shows it prevents entropy collapse, maintains global token dependency at appropriate levels, and achieves competitive gains in small-scale models.
Transformer-based self-attention mechanism serves as the core of modern language models, yet it often suffers from localization, where attentions collapse onto a limited subset of tokens and fail to capture long-range dependencies. To address this issue, we propose Self-Attention One-step Belief Propagation (SAOBP), a refinement framework that injects multi-hop relationships through a belief propagation process. To interpret and quantify these interactions, we introduce Global Token Dependency (GTD) that captures the relative contribution of multihop connections within the attention graph. Empirical results indicate that SAOBP helps prevent entropy collapse in deeper layers and adaptively maintains GTD at task-appropriate levels, thereby supporting improvements in model performance. Importantly, we observe competitive gains in small-scale models, highlighting its potential for improving inference quality in resource-constrained scenarios.