LGMay 15

Provably Shorter Scratchpads in Hybrid DeltaNet-Attention Decoders

arXiv:2605.166400.23Has Code

AI Analysis55

Provides theoretical evidence that hybrid recurrent-attention architectures offer expressivity advantages over pure recurrent or pure attention models for language modeling.

The paper proves that hybrid Gated DeltaNet-Attention decoders solve a parity-conditioned retrieval task with constant scratchpad size, whereas pure Gated DeltaNet cannot solve it and pure Gated Attention requires polynomial scratchpad.

We investigate the expressive power of hybrid recurrent-attention decoders, a class of architectures used in recent open-source language models such as Qwen3-Next and its successors. These models combine Gated Attention heads with recurrent Gated DeltaNet heads. Is there a formal advantage, in terms of model expressivity or efficiency, to such a hybrid architecture? We show that there is. We define parity-conditioned retrieval task and show that under constant-precision assumption, a Qwen-style hybrid of Gated DeltaNet and Gated Attention solves this task with a constant scratchpad, or equivalently $O(1)$ chain-of-thought steps. In contrast, no similar solution exists for pure Gated DeltaNet models, while pure Gated Attention requires at least a polynomial scratchpad.

View on arXiv PDF

Similar