ALiBi (Long-context / context-window extension): superseded — cited as a baseline and beaten by newer methods. 4 paper(s) critique it, 5 beat it on benchmarks — #7 of 53 most-superseded. Sub-problem: cluster led by RoPE. Newer alternatives in the same sub-problem include Mask Prior Suppression and Monotonic RoPE Scaling, CRoPE, C^2RoPE, Imaginary Extension of Rotary Position Embeddings, Selective RoPE.

Method Drift›Long-context / context-window extension

Superseded baseline#7 of 53 most-superseded

ALiBi

Train Short, Test Long: Attention with Linear Biases Enables Input Length Extrapolation

Long-context / context-window extension · first seen Aug 27, 2021

superseded — cited as a baseline and beaten by newer methods

4 papers critique it · 5 beat it on benchmarks

What papers say

Verbatim critique sentences, each from a paper that cites ALiBi as a baseline.

“Alibi press2022trainshorttestlong enhanced extrapolation capability through a distance-decaying linear attention bias, but its heuristic design lacks guarantees for monotonic decay, leading to suboptimal performance on extremely long sequences.”
— HoPE: Hyperbolic Rotary Positional Encoding for Stable Long-Range Dependency Modeling in Large Language Models
“However, the function rapidly approaches the zero point”
— Context-aware Biases for Length Extrapolation
“why it fails to retrieve information as it becomes local attention as the context length increases”
— Bayesian Attention Mechanism: A Probabilistic Framework for Positional Encoding and Context Length Extrapolation
“However, the function rapidly approaches the zero point~chi2022kerple.”
— MEP: Multiple Kernel Learning Enhancing Relative Positional Encoding Length Extrapolation

Beaten on benchmarks

Head-to-head results where a newer method reports beating ALiBi. Values are copied from the source paper's tables — verify against the cited paper.

DAPE-Kerple beats ALiBi · perplexity (mean) [training_length_512_eval_8192]
3.8642 vs 4.7679
DAPE: Data-Adaptive Positional Encoding for Length Extrapolation
HoPE beats ALiBi · RGL [GovReport task]
19.34 vs 18.83
HoPE: Hyperbolic Rotary Positional Encoding for Stable Long-Range Dependency Modeling in Large Language Models
CABLE beats ALiBi · Perplexity [GPT-2 Medium on FineWeb-Edu-10B, trained on T=1024]
17.00 vs 17.30
Context-aware Biases for Length Extrapolation
CABLE beats ALiBi · Perplexity [GPT-2 Medium on WikiText-103, trained on T=1024]
23.70 vs 24.09
Context-aware Biases for Length Extrapolation
CABLE beats ALiBi · nDCG@10 [BERT models trained on T=512]
21.36 vs 18.71
Context-aware Biases for Length Extrapolation
BiPE-RoPE beats ALiBi · Average [ALiBi baseline]
22.36 vs 17.62
Two Stones Hit One Bird: Bilevel Positional Encoding for Better Length Extrapolation
MEP beats ALiBi · Perplexity [OpenWebText2, parameter-free]
21.92 vs 22.14
MEP: Multiple Kernel Learning Enhancing Relative Positional Encoding Length Extrapolation
MEP beats ALiBi · Perplexity [GitHub, parameter-free]
2.436 vs 2.450
MEP: Multiple Kernel Learning Enhancing Relative Positional Encoding Length Extrapolation
MEP beats ALiBi · Perplexity [ArXiv, parameter-free]
5.612 vs 5.640
MEP: Multiple Kernel Learning Enhancing Relative Positional Encoding Length Extrapolation

Newer alternatives

Recent methods in the same sub-problem, not yet superseded in the knowledge base.