Is Ring Attention superseded?

Q: Is Ring Attention superseded?

Ring Attention (Long-context / context-window extension): superseded — cited as a baseline and beaten by newer methods. 1 paper(s) critique it, 1 beat it on benchmarks — #35 of 53 most-superseded. Sub-problem: cluster led by Ring Attention.

Method Drift›Long-context / context-window extension

Superseded baseline#35 of 53 most-superseded

Ring Attention

Long-context / context-window extension

superseded — cited as a baseline and beaten by newer methods

1 papers critique it · 1 beat it on benchmarks

What papers say

Verbatim critique sentences, each from a paper that cites Ring Attention as a baseline.

“While this approach achieves linear complexity O(nk), it suffers from two critical limitations: (1) limited receptive field growth that scales linearly with depth”
— $π$-Attention: Periodic Sparse Transformers for Efficient Long-Context Modeling

Beaten on benchmarks

Head-to-head results where a newer method reports beating Ring Attention. Values are copied from the source paper's tables — verify against the cited paper.

PiAttention (Ours) beats Ring Attention · Acc [ListOps]
67.9 vs 62.3
$π$-Attention: Periodic Sparse Transformers for Efficient Long-Context Modeling
PiAttention (Ours) beats Ring Attention · F1 [RetrievalQA]
84.5 vs 78.9
$π$-Attention: Periodic Sparse Transformers for Efficient Long-Context Modeling
PiAttention (Ours) beats Ring Attention · Acc [Pathfinder]
89.1 vs 85.2
$π$-Attention: Periodic Sparse Transformers for Efficient Long-Context Modeling
PiAttention (Ours) beats Ring Attention · R@1 [MSCOCO]
72.4 vs 68.3
$π$-Attention: Periodic Sparse Transformers for Efficient Long-Context Modeling
PiAttention (Ours) beats Ring Attention · R@5 [MSCOCO]
91.2 vs 88.9
$π$-Attention: Periodic Sparse Transformers for Efficient Long-Context Modeling
PiAttention (Ours) beats Ring Attention · R@10 [MSCOCO]
96.8 vs 95.1
$π$-Attention: Periodic Sparse Transformers for Efficient Long-Context Modeling
PiAttention (Ours) beats Ring Attention · R@1 [Flickr30K]
76.3 vs 72.1
$π$-Attention: Periodic Sparse Transformers for Efficient Long-Context Modeling
PiAttention (Ours) beats Ring Attention · R@5 [Flickr30K]
94.1 vs 91.8
$π$-Attention: Periodic Sparse Transformers for Efficient Long-Context Modeling
PiAttention (Ours) beats Ring Attention · R@10 [Flickr30K]
98.2 vs 96.9
$π$-Attention: Periodic Sparse Transformers for Efficient Long-Context Modeling
PiAttention (Ours) beats Ring Attention · Training Time [WikiText-103]
12.4 vs 14.6
$π$-Attention: Periodic Sparse Transformers for Efficient Long-Context Modeling
PiAttention (Ours) beats Ring Attention · Inference Time [WikiText-103]
36.7 vs 44.3
$π$-Attention: Periodic Sparse Transformers for Efficient Long-Context Modeling
PiAttention (Ours) beats Ring Attention · MFU [WikiText-103]
55.4 vs 51.7
$π$-Attention: Periodic Sparse Transformers for Efficient Long-Context Modeling