Is KERPLE superseded?

KERPLE (Long-context / context-window extension): superseded — cited as a baseline and beaten by newer methods. 2 paper(s) critique it, 3 beat it on benchmarks — #12 of 53 most-superseded. Sub-problem: cluster led by RoPE. Newer alternatives in the same sub-problem include Mask Prior Suppression and Monotonic RoPE Scaling, CRoPE, C^2RoPE, Imaginary Extension of Rotary Position Embeddings, Selective RoPE.

Method Drift›Long-context / context-window extension

Superseded baseline#12 of 53 most-superseded

KERPLE

KERPLE: Kernelized Relative Positional Embedding for Length Extrapolation

Long-context / context-window extension · first seen May 20, 2022

superseded — cited as a baseline and beaten by newer methods

2 papers critique it · 3 beat it on benchmarks

What papers say

Verbatim critique sentences, each from a paper that cites KERPLE as a baseline.

“the learned static positional encoding (such as Kerple and FIRE) is an average optimal solution across all training samples. Consequently, while they might be generally effective, they are inherently suboptimal for any specific instance.”
— DAPE: Data-Adaptive Positional Encoding for Length Extrapolation
“However, this incorporation of additional trainable parameters results in diminished training velocities.”
— MEP: Multiple Kernel Learning Enhancing Relative Positional Encoding Length Extrapolation

Beaten on benchmarks

Head-to-head results where a newer method reports beating KERPLE. Values are copied from the source paper's tables — verify against the cited paper.

DAPE-Kerple beats KERPLE · perplexity (mean) [training_length_512_eval_8192]
3.8642 vs 13.3524
DAPE: Data-Adaptive Positional Encoding for Length Extrapolation
DAPE-Kerple beats KERPLE · perplexity (mean) [training_length_512_eval_2048]
4.0505 vs 5.4438
DAPE: Data-Adaptive Positional Encoding for Length Extrapolation
CABLE beats KERPLE · Perplexity [GPT-2 Medium on FineWeb-Edu-10B, trained on T=1024]
15.41 vs 26.13
Context-aware Biases for Length Extrapolation
MEP beats KERPLE · Perplexity [OpenWebText2, parametric]
21.23 vs 21.27
MEP: Multiple Kernel Learning Enhancing Relative Positional Encoding Length Extrapolation
MEP beats KERPLE · Perplexity [GitHub, parametric]
2.239 vs 2.242
MEP: Multiple Kernel Learning Enhancing Relative Positional Encoding Length Extrapolation

Newer alternatives

Recent methods in the same sub-problem, not yet superseded in the knowledge base.