LGApr 23

The Recurrent Transformer: Greater Effective Depth and Efficient Decoding

arXiv:2604.2121580.63 citationsh-index: 17
Predicted impact top 15% in LG · last 90 daysOriginality Highly original
AI Analysis

For practitioners of autoregressive sequence modeling, the Recurrent Transformer offers a way to improve performance and efficiency by trading depth for width, addressing the temporal shallowness of standard Transformers.

The Recurrent Transformer introduces layerwise recurrent memory by having each layer attend to its own key-value pairs, achieving greater effective depth without increasing decoding cost. On 150M and 300M parameter C4 pretraining, it improves cross-entropy over parameter-matched Transformers with fewer layers, reducing KV cache memory and inference latency.

Transformers process tokens in parallel but are temporally shallow: at position $t$, each layer attends to key-value pairs computed based on the previous layer, yielding a depth capped by the number of layers. Recurrent models offer unbounded temporal depth but suffer from optimization instability and historically underutilize modern accelerators. We introduce the Recurrent Transformer, a simple architectural change where each layer attends to key-value pairs computed off its own activations, yielding layerwise recurrent memory while preserving standard autoregressive decoding cost. We show that the architecture can emulate both (i) a conventional Transformer and (ii) token-to-token recurrent updates under mild assumptions, while avoiding optimization instability. Naively, prefill/training appears bandwidth-bound with effective arithmetic intensity near $1$ because keys and values are revealed sequentially; we give an exact tiling-based algorithm that preserves the mathematical computation while reducing HBM traffic from $Θ(N^2)$ to $Θ(N\log N)$, increasing effective arithmetic intensity to $Θ(N/\log N)$ for sequence length $N$. On 150M and 300M parameter C4 pretraining, Recurrent Transformers improve cross-entropy over a parameter-matched Transformer baseline and achieve the improvement with fewer layers (fixed parameters), suggesting that recurrence can trade depth for width, thus reducing KV cache memory footprint and inference latency.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes