Method Drift›Long-context / context-window extension
RoPE
RoFormer: Enhanced Transformer with Rotary Position EmbeddingLong-context / context-window extension · first seen Apr 20, 2021
heavily superseded — a standard baseline that newer methods routinely beat
17 papers critique it · 8 beat it on benchmarks
What papers say
Verbatim critique sentences, each from a paper that cites RoPE as a baseline.
“RoPE-based language models have poor length generalization.”
— DAPE: Data-Adaptive Positional Encoding for Length Extrapolation“However, RoPE can only operates on global angles, rendering relative angles implicit and inaccessible. Thus, RoPE struggles with periodic angular relations essential in trajectory prediction since it fails to address modular transformations”
— DRoPE: Directional Rotary Position Embedding for Efficient Agent Interaction Modeling“Although effective, RoPE still relies on predefined static frequency patterns that are uniform across different inputs and attention heads. As a result, it remains position-dependent but not token- or context-dependent, limiting its expressiveness in modeling more nuanced sequence structures.”
— Context-aware Rotary Position Embedding“the inherent flaw of rotary position embedding (RoPE) being used”
— Understanding the RoPE Extensions of Long-Context LLMs: An Attention Perspective“However, RoPE's 1D design, effective for text, overlooks the spatiotemporal structure of video data, limiting its suitability for Video-LLMs.”
— VRoPE: Rotary Position Embedding for Video Large Language Models“However, RoPE exhibits oscillatory attention patterns due to its trigonometric periodicity, which can destabilize long-distance dependency modeling barbero2024roundroundgomakes.”
— HoPE: Hyperbolic Rotary Positional Encoding for Stable Long-Range Dependency Modeling in Large Language Models“positional attention collapse, induced by the inherent locality bias of RoPE”
— Mitigating Mask Prior Drift and Positional Attention Collapse in Large Diffusion Vision-Language Models“However, barbero2024round later provided a mathematical analysis showing that this claim is flawed: attention weights under RoPE do not necessarily decay proportionally with relative query-key distances.”
— Context-aware Biases for Length Extrapolation“it does not decouple content and position semantically”
— Bayesian Attention Mechanism: A Probabilistic Framework for Positional Encoding and Context Length Extrapolation“RoPE suffers from long-term decay, as shown in Figure~fig:correlation(c), implying that as the relative distance increases, the relative upper bound on token correlations at modeled relative positions will continuously decrease.”
— 3D-RPE: Enhancing Long-Context Modeling Through 3D Rotary Position Encoding“We hypothesize that, for long distance attention, the way that RoPE rotates the query and the key vectors may prevent the model from utilizing the dimensions that it rotates significantly.”
— The Rotary Position Embedding May Cause Dimension Inefficiency in Attention Heads for Long-Distance Retrieval“The essential components (i.e., the RoPE matrices) of previous RoPE approaches rely on 2D rotation groups, which simplify computations but consequently restrict their feature projection capabilities, especially in high-dimensional spaces”
— ComRoPE: Scalable and Robust Rotary Position Embedding Parameterized by Trainable Commuting Angle Matrices
Beaten on benchmarks
Head-to-head results where a newer method reports beating RoPE. Values are copied from the source paper's tables — verify against the cited paper.
- DAPE: Data-Adaptive Positional Encoding for Length Extrapolation
DAPE-Kerple beats RoPE · perplexity (mean) [training_length_512_eval_8192]
3.8642 vs 265.4545
- DAPE: Data-Adaptive Positional Encoding for Length Extrapolation
DAPE-Kerple beats RoPE · perplexity (mean) [training_length_512_eval_2048]
4.0505 vs 134.1615
- Context-aware Rotary Position Embedding
CARoPE beats RoPE · Perplexity [GPT-Small models]
21.23 vs 21.31
- Context-aware Rotary Position Embedding
CARoPE beats RoPE · Perplexity [GPT-Tiny models]
28.99 vs 29.33
- VRoPE: Rotary Position Embedding for Video Large Language Models
VRoPE beats RoPE · Avg. [Video-Vicuna-7B]
44.48 vs 43.35
- VRoPE: Rotary Position Embedding for Video Large Language Models
VRoPE beats RoPE · Avg. [Video-Qwen2-1.5B]
49.96 vs 48.90
- VRoPE: Rotary Position Embedding for Video Large Language Models
VRoPE beats RoPE · Avg. [Video-Qwen2-7B]
56.35 vs 54.92
- VRoPE: Rotary Position Embedding for Video Large Language Models
VRoPE beats RoPE · Accuracy [256-512]
98.28 vs 94.84
- VRoPE: Rotary Position Embedding for Video Large Language Models
VRoPE beats RoPE · Accuracy [512-768]
95.16 vs 87.03
- VRoPE: Rotary Position Embedding for Video Large Language Models
VRoPE beats RoPE · Accuracy [768-1024]
90.31 vs 73.28
- VRoPE: Rotary Position Embedding for Video Large Language Models
VRoPE beats RoPE · Accuracy [1024-1216]
87.03 vs 54.84
- HoPE: Hyperbolic Rotary Positional Encoding for Stable Long-Range Dependency Modeling in Large Language Models
HoPE beats RoPE · perplexity [2048 sequence length]
16.46 vs 25.80
Newer alternatives
Recent methods in the same sub-problem, not yet superseded in the knowledge base.
- Mask Prior Suppression and Monotonic RoPE ScalingMitigating Mask Prior Drift and Positional Attention Collapse in Large Diffusion Vision-Language ModelsMay 14, 2026
- Apr 1, 2026
- C^2RoPEC^2ROPE: Causal Continuous Rotary Positional Encoding for 3D Large Multimodal-Models ReasoningFeb 11, 2026
- Imaginary Extension of Rotary Position EmbeddingsBeyond Real: Imaginary Extension of Rotary Position Embeddings for Long-Context LLMsDec 8, 2025
- Nov 21, 2025