LGJun 26, 2025

NaLaFormer: Norm-Aware Linear Attention for Transformer Models

Weikang Meng, Yadan Luo, Liangyu Huo, Yaowei Wang, Xin Li, Zheng Zhang

arXiv:2506.21137v19.42 citationsh-index: 5

Originality Incremental advance

AI Analysis

This work addresses a specific bottleneck in transformer models for researchers and practitioners, offering incremental improvements in efficiency and expressiveness.

The paper tackles the problem of entropy gaps and missing inner-product interactions in linear attention mechanisms by proposing NaLaFormer, a norm-aware linear attention that restores dynamic spikiness and norm consistency, resulting in performance improvements of up to 4.2% on vision and language tasks.

Linear attention has emerged as a viable alternative to softmax attention by reducing complexity from quadratic to linear in sequence length. To preserve two fundamental properties of softmax, non-negativity and entropy reduction, current works employ various linearly separatable kernel functions with $L1$ normalization instead of softmax operator. However, query norms are neglected by the normalization operation in linear attention, such degradation heavily leads to an entropy gap. Meanwhile, existing works inhibit negative values of query and key vectors resulting in a missing inner-product interactions after being mapped. To address these dual challenges, we propose a novel Norm-Aware Linear Attention mechanism serving to restore norm-guided dynamic spikiness and recover kernel-perturbed norm distributions. Specifically, we first decouple query and key matrices into two components: norm and direction, to achieve norm-aware spikiness control and norm consistency, respectively. We mathematically reveal that the extent of entropy reduction varies with the query norm in softmax normalization, motivating a query-norm aware kernel function for dynamic control over entropy reduction. Furthermore, to ensure norm consistency and enforce non-negativity constraints, we employ a norm-preserving mapping to project all elements of the angular matrix into positive values, leveraging cosine similarity to inhibit dimensions with opposite directions. We conduct extensive experiments demonstrating that the NaLaFormer improves performance on vision and language tasks, enhancing both expressiveness and efficiency by up to 4.2\%.

View on arXiv PDF

Similar