CLJan 15, 2025

Information Entropy Invariance: Enhancing Length Extrapolation in Attention Mechanisms

arXiv:2501.08570v33 citationsh-index: 15
Originality Incremental advance
AI Analysis

This addresses the challenge of attention score dilution in long-range context handling for large language models, representing an incremental improvement over prior methods.

The paper tackles the problem of improving length extrapolation in attention mechanisms by proposing two new scaled temperatures based on information entropy invariance, achieving state-of-the-art performance with a context window extended to 64 times the training length and outperforming seven existing methods.

Since the emergence of research on improving the length extrapolation capabilities of large language models in 2021, some studies have made modifications to the scaling factor in the scaled dot-product attention mechanism as part of their proposed methods without rigorous theoretical justifications. To fill this gap, we propose two new scaled temperatures based on information entropy invariance to enhance length extrapolation. First, a training-free method InfoScale is designed for dotproduct attention, and preserves focus on original tokens during length extrapolation by ensuring consistent entropy. Second, we theoretically analyze the impact of scaling (CosScale) on cosine attention. Experimental data demonstrates that combining InfoScale and CosScale achieves state-ofthe-art performance on the GAU-α model with a context window extended to 64 times the training length, and outperforms seven existing methods. Our analysis reveals that significantly increasing CosScale approximates the Windowed Attention, and highlights the significance of attention score dilution as a key challenge in long-range context handling. The code and data are available at https://github.com/HT-NEKO/ Information-Entropy-Invariance.

Code Implementations1 repo
Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes