CVAIJan 25, 2025

PolaFormer: Polarity-aware Linear Attention for Vision Transformers

arXiv:2501.15061v270 citationsh-index: 2ICLR
Originality Incremental advance
AI Analysis

This work addresses a specific bottleneck in vision transformers for researchers and practitioners, offering an incremental improvement in efficiency and expressiveness.

The paper tackled the problem of information loss in linear attention mechanisms for vision transformers, which leads to less discriminative attention maps, by proposing a polarity-aware linear attention that models both same-signed and opposite-signed query-key interactions and reduces entropy with a learnable power function, resulting in performance improvements of up to 4.6% on various vision tasks.

Linear attention has emerged as a promising alternative to softmax-based attention, leveraging kernelized feature maps to reduce complexity from quadratic to linear in sequence length. However, the non-negative constraint on feature maps and the relaxed exponential function used in approximation lead to significant information loss compared to the original query-key dot products, resulting in less discriminative attention maps with higher entropy. To address the missing interactions driven by negative values in query-key pairs, we propose a polarity-aware linear attention mechanism that explicitly models both same-signed and opposite-signed query-key interactions, ensuring comprehensive coverage of relational information. Furthermore, to restore the spiky properties of attention maps, we provide a theoretical analysis proving the existence of a class of element-wise functions (with positive first and second derivatives) that can reduce entropy in the attention distribution. For simplicity, and recognizing the distinct contributions of each dimension, we employ a learnable power function for rescaling, allowing strong and weak attention signals to be effectively separated. Extensive experiments demonstrate that the proposed PolaFormer improves performance on various vision tasks, enhancing both expressiveness and efficiency by up to 4.6%.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes