CVAIJun 10, 2025

SEMA: a Scalable and Efficient Mamba like Attention via Token Localization and Averaging

arXiv:2506.08297v11 citationsh-index: 19
Originality Incremental advance
AI Analysis

This addresses the problem of scaling attention for computer vision tasks, offering a more efficient alternative to existing methods, though it appears incremental as it builds on recent Mamba developments.

The paper tackles the computational inefficiency and lack of focus in attention mechanisms for computer vision by proposing SEMA, a scalable and efficient Mamba-like attention method that uses token localization and averaging; it outperforms recent vision Mamba models on ImageNet-1k classification at larger image scales with similar parameter sizes.

Attention is the critical component of a transformer. Yet the quadratic computational complexity of vanilla full attention in the input size and the inability of its linear attention variant to focus have been challenges for computer vision tasks. We provide a mathematical definition of generalized attention and formulate both vanilla softmax attention and linear attention within the general framework. We prove that generalized attention disperses, that is, as the number of keys tends to infinity, the query assigns equal weights to all keys. Motivated by the dispersion property and recent development of Mamba form of attention, we design Scalable and Efficient Mamba like Attention (SEMA) which utilizes token localization to avoid dispersion and maintain focusing, complemented by theoretically consistent arithmetic averaging to capture global aspect of attention. We support our approach on Imagenet-1k where classification results show that SEMA is a scalable and effective alternative beyond linear attention, outperforming recent vision Mamba models on increasingly larger scales of images at similar model parameter sizes.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes