CL AISep 16, 2025

Positional Encoding via Token-Aware Phase Attention

Yu Wang, Sheng Shen, Rémi Munos, Hongyuan Zhan, Yuandong Tian

arXiv:2509.12635v23 citationsh-index: 4

Originality Incremental advance

AI Analysis

This addresses a key limitation in transformer architectures for long-context applications, though it appears incremental as an improvement over existing RoPE methods.

The paper tackles the problem of Rotary Positional Embedding (RoPE) introducing a distance-dependent bias that limits long-context modeling, and introduces Token-Aware Phase Attention (TAPA) as a new positional encoding method that achieves significantly lower perplexity on long-context tasks.

We prove under practical assumptions that Rotary Positional Embedding (RoPE) introduces an intrinsic distance-dependent bias in attention scores that limits RoPE's ability to model long-context. RoPE extension methods may alleviate this issue, but they typically require post-hoc adjustments after pretraining, such as rescaling or hyperparameters retuning. This paper introduces Token-Aware Phase Attention (TAPA), a new positional encoding method that incorporates a learnable phase function into the attention mechanism. TAPA preserves token interactions over long range, extends to longer contexts with direct and light fine-tuning, extrapolates to unseen lengths, and attains significantly lower perplexity on long-context than RoPE families.

View on arXiv PDF

Similar