Method Drift›Long-context / context-window extension
YaRN
YaRN: Efficient Context Window Extension of Large Language ModelsLong-context / context-window extension · first seen Aug 31, 2023
heavily superseded — a standard baseline that newer methods routinely beat
10 papers critique it · 8 beat it on benchmarks
What papers say
Verbatim critique sentences, each from a paper that cites YaRN as a baseline.
“traditional approaches chen2023extending often suffer from a significant performance drop chen2023clex, ding2024longrope at the target length due to their limited generalization capability.”
— DCIS: Efficient Length Extrapolation of LLMs via Divide-and-Conquer Scaling Factor Search“the efficiency is relatively low. For example, to reach the context length of 128K tokens, using YaRN, one has to pretrain an LLM on 64K tokens.”
— Stacked from One: Multi-Scale Self-Injection for Context Window Extension“rescaling factors derived from previous methods often fall short of achieving the effective target context length.”
— LongRoPE2: Near-Lossless LLM Context Window Scaling“PI and YaRN suffer from slow motion, leading to lower Dynamic Degree”
— RIFLEx: A Free Lunch for Length Extrapolation in Video Diffusion Transformers“However, these static approaches do not account for the distinctive spectral progression of the diffusion process, where low-frequency structures are generated in the first sampling steps, while high-frequency details are resolved later”
— DyPE: Dynamic Position Extrapolation for Ultra High Resolution Diffusion“these methods typically require finetuning to achieve extension, which can be resource and time-intensive given the quadratic complexity of Transformers”
— LLM Maybe LongLM: Self-Extend LLM Context Window Without Tuning“methods like NTK, Dyn-NTK, and YaRN suffer from attention logit outliers due to their positional embedding interpolations”
— A Training-Free Length Extrapolation Approach for LLMs: Greedy Attention Logit Interpolation (GALI)“While NTK-By-Parts and YaRN have lower perplexity in language modeling tasks, PI has better fine-tuning performance on long-context downstream tasks that are more related to practical scenarios.”
— Extending LLMs' Context Window with 100 Samples“although YaRN improves the length extrapolation capability of RoPE to some extent, it still suffers from performance drop when being evaluated on very long sequences”
— Two Stones Hit One Bird: Bilevel Positional Encoding for Better Length Extrapolation“these methods typically require additional fine-tuning on longer texts and have explicit extrapolation upper bounds.”
— LaMPE: Length-aware Multi-grained Positional Encoding for Adaptive Long-context Scaling Without Training
Beaten on benchmarks
Head-to-head results where a newer method reports beating YaRN. Values are copied from the source paper's tables — verify against the cited paper.
- LongRoPE2: Near-Lossless LLM Context Window Scaling
RULER beats YaRN · RULER average at 128k [Base Model: Phi3-mini (3.8B)]
58.81 vs 39.37
- LongRoPE2: Near-Lossless LLM Context Window Scaling
RULER beats YaRN · RULER average at 128k [Base Model: LLaMA3-8B]
82.03 vs 49.39
- LongRoPE2: Near-Lossless LLM Context Window Scaling
RULER beats YaRN · Average [Base Model: Phi3-mini (3.8B) with 128k context window]
61.7 vs 53.6
- LongRoPE2: Near-Lossless LLM Context Window Scaling
RULER beats YaRN · Average [Base Model: LLaMA3-8B with 128k context window]
55.7 vs 52.1
- LongRoPE2: Near-Lossless LLM Context Window Scaling
RULER beats YaRN · LOFT Avg. [Base model: Phi3-mini (3.8B)]
23.00 vs 5.86
- LongRoPE2: Near-Lossless LLM Context Window Scaling
RULER beats YaRN · InfiniteBench - LongBench Avg. [Base model: Phi3-mini (3.8B)]
55.23 vs 50.96
- LongRoPE2: Near-Lossless LLM Context Window Scaling
RULER beats YaRN · LOFT Avg. [Base model: LLaMA3-8B]
74.28 vs 26.14
- LongRoPE2: Near-Lossless LLM Context Window Scaling
RULER beats YaRN · InfiniteBench - LongBench Avg. [Base model: LLaMA3-8B]
73.37 vs 51.81
- Effective Length Extrapolation via Dimension-Wise Positional Embeddings Manipulation
\name beats YaRN · Avg. (13 tasks) [Mistral-v0.2 (7B), 32K / 128K]
67.13 vs 62.35
- RIFLEx: A Free Lunch for Length Extrapolation in Video Diffusion Transformers
RIFLEx (ours) beats YaRN · Overall metrics [CogVideoX-5B with 2x extrapolation, training-free]
56.9 vs 44.6
- RIFLEx: A Free Lunch for Length Extrapolation in Video Diffusion Transformers
RIFLEx (ours) beats YaRN · Overall metrics [HunyuanVideo with 2x extrapolation, training-free]
65.2 vs 58.2
- PSC: Extending Context Window of Large Language Models via Phase Shift Calibration
YaRN_PSC beats YaRN · perplexity at 65536 tokens [YaRN YaRN 64k context]
2.05 vs 2.08
Newer alternatives
Recent methods in the same sub-problem, not yet superseded in the knowledge base.