CLAug 25, 2025

Integral Transformer: Denoising Attention, Not Too Much Not Too Little

arXiv:2508.18387v13 citationsh-index: 13EMNLP
Originality Incremental advance
AI Analysis

This addresses a specific issue in Transformer models for language processing, offering an incremental improvement over existing attention mechanisms.

The paper tackles the problem of attention noise in softmax self-attention, where uninformative tokens receive disproportionate weight, by proposing the Integral Transformer, which denoises attention by integrating signals from the logit distribution, resulting in outperforming vanilla, Cog, and Differential attention variants on knowledge and reasoning benchmarks.

Softmax self-attention often assigns disproportionate weight to semantically uninformative tokens such as special tokens and punctuation, a phenomenon known as attention noise. While recent methods like Cog Attention and the Differential Transformer have addressed this by introducing negative attention scores, they risk discarding useful information. In this paper, we propose the Integral Transformer, a novel self-attention mechanism that denoises attention by integrating signals sampled from the logit distribution. Our approach mitigates noise while preserving the contributions of special tokens critical for model performance. Extensive experiments demonstrate that our model outperforms vanilla, Cog, and Differential attention variants on well-established knowledge and reasoning language benchmarks. Moreover, our analysis reveals that employing vanilla self-attention in the lower Transformer layers enhances performance and that the Integral Transformer effectively balances attention distributions and reduces rank collapse in upper layers.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes