AICLDec 23, 2024

Fourier Position Embedding: Enhancing Attention's Periodic Extension for Length Generalization

Tsinghua
arXiv:2412.17739v433 citationsh-index: 35ICML
Originality Incremental advance
AI Analysis

This addresses length generalization in Language Models, an incremental improvement for NLP applications.

The paper tackles the problem of extending context length in Language Models by identifying that Rotary Position Embedding's periodic attention is undermined by spectrum damage from linear layers and insufficiently trained frequencies, proposing Fourier Position Embedding (FoPE) to enhance frequency-domain properties, which results in more stable performance across varying context windows compared to baselines.

Extending the context length of Language Models (LMs) by improving Rotary Position Embedding (RoPE) has become a trend. While prior works mainly address RoPE's limitations within attention, this paper uncovers the adverse effects on length generalization from nearly all parts of LMs. Using Discrete Signal Processing theory, we show that RoPE enables periodic attention by implicitly achieving Non-Uniform Discrete Fourier Transform. However, this periodicity is undermined by the spectrum damage caused by: 1) linear layers and activation functions; 2) insufficiently trained frequency components brought by time-domain truncation. Building on our observations, we propose Fourier Position Embedding (FoPE), which enhances attention's frequency-domain properties to improve both its periodic extension and length generalization. FoPE constructs \textit{Fourier Series} and zero-outs the destructive frequency components, increasing model robustness against the spectrum damage. Experiments across various model scales and benchmarks show that, within varying context windows, FoPE maintains a more stable performance compared to other baselines. Several analyses and ablations bring further support to our method and theoretical modeling.

Code Implementations1 repo
Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes