Method Drift›Long-context / context-window extension
ALiBi
Train Short, Test Long: Attention with Linear Biases Enables Input Length ExtrapolationLong-context / context-window extension · first seen Aug 27, 2021
superseded — cited as a baseline and beaten by newer methods
4 papers critique it · 5 beat it on benchmarks
What papers say
Verbatim critique sentences, each from a paper that cites ALiBi as a baseline.
“Alibi press2022trainshorttestlong enhanced extrapolation capability through a distance-decaying linear attention bias, but its heuristic design lacks guarantees for monotonic decay, leading to suboptimal performance on extremely long sequences.”
— HoPE: Hyperbolic Rotary Positional Encoding for Stable Long-Range Dependency Modeling in Large Language Models“However, the function rapidly approaches the zero point”
— Context-aware Biases for Length Extrapolation“why it fails to retrieve information as it becomes local attention as the context length increases”
— Bayesian Attention Mechanism: A Probabilistic Framework for Positional Encoding and Context Length Extrapolation“However, the function rapidly approaches the zero point~chi2022kerple.”
— MEP: Multiple Kernel Learning Enhancing Relative Positional Encoding Length Extrapolation
Beaten on benchmarks
Head-to-head results where a newer method reports beating ALiBi. Values are copied from the source paper's tables — verify against the cited paper.
- DAPE: Data-Adaptive Positional Encoding for Length Extrapolation
DAPE-Kerple beats ALiBi · perplexity (mean) [training_length_512_eval_8192]
3.8642 vs 4.7679
- HoPE: Hyperbolic Rotary Positional Encoding for Stable Long-Range Dependency Modeling in Large Language Models
HoPE beats ALiBi · RGL [GovReport task]
19.34 vs 18.83
- Context-aware Biases for Length Extrapolation
CABLE beats ALiBi · Perplexity [GPT-2 Medium on FineWeb-Edu-10B, trained on T=1024]
17.00 vs 17.30
- Context-aware Biases for Length Extrapolation
CABLE beats ALiBi · Perplexity [GPT-2 Medium on WikiText-103, trained on T=1024]
23.70 vs 24.09
- Context-aware Biases for Length Extrapolation
CABLE beats ALiBi · nDCG@10 [BERT models trained on T=512]
21.36 vs 18.71
- Two Stones Hit One Bird: Bilevel Positional Encoding for Better Length Extrapolation
BiPE-RoPE beats ALiBi · Average [ALiBi baseline]
22.36 vs 17.62
- MEP: Multiple Kernel Learning Enhancing Relative Positional Encoding Length Extrapolation
MEP beats ALiBi · Perplexity [OpenWebText2, parameter-free]
21.92 vs 22.14
- MEP: Multiple Kernel Learning Enhancing Relative Positional Encoding Length Extrapolation
MEP beats ALiBi · Perplexity [GitHub, parameter-free]
2.436 vs 2.450
- MEP: Multiple Kernel Learning Enhancing Relative Positional Encoding Length Extrapolation
MEP beats ALiBi · Perplexity [ArXiv, parameter-free]
5.612 vs 5.640
Newer alternatives
Recent methods in the same sub-problem, not yet superseded in the knowledge base.
- Mask Prior Suppression and Monotonic RoPE ScalingMitigating Mask Prior Drift and Positional Attention Collapse in Large Diffusion Vision-Language ModelsMay 14, 2026
- Apr 1, 2026
- C^2RoPEC^2ROPE: Causal Continuous Rotary Positional Encoding for 3D Large Multimodal-Models ReasoningFeb 11, 2026
- Imaginary Extension of Rotary Position EmbeddingsBeyond Real: Imaginary Extension of Rotary Position Embeddings for Long-Context LLMsDec 8, 2025
- Nov 21, 2025