CLJul 6, 2025

RAT: Bridging RNN Efficiency and Attention Accuracy via Chunk-based Sequence Modeling

Xiuying Wei, Anunay Yadav, Razvan Pascanu, Caglar Gulcehre

DeepMind

arXiv:2507.04416v34.91 citationsh-index: 75Has Code

Originality Incremental advance

AI Analysis

This addresses efficiency issues in large-scale language models for AI practitioners, though it is incremental as it builds on existing RNN and attention methods.

The paper tackles the computational bottleneck of Transformers by proposing RAT, a chunk-based sequence model that bridges RNN efficiency and attention accuracy, achieving a 7x training speed improvement for 100K sequences and 9x generation speed at 4K positions while maintaining similar performance.

Transformers have become the cornerstone of modern large-scale language models, but their reliance on softmax attention poses a computational bottleneck at both training and inference. Recurrent models offer high efficiency, but compressing the full sequence into a fixed-size and holistic representation can suffer from memory degradation in long contexts and limit fine-grained retrieval. To address this, we propose RAT, an intermediate design that bridges the efficiency of RNNs and capacity of attention. RAT partitions the input into chunks, applies recurrence within each chunk for local dependencies, and softmax-based attention across chunks for long-range interactions. This design mitigates memory degradation and enables direct access to distant tokens, while retaining computational efficiency. Empirically, with a chunk size of 16, the RAT block achieves a 7$\times$ improvement in training speed for 100K sequence length and 9$times$ in generation at the 4K position, while maintaining similar performance compared to standard attention. We demonstrate this by training 1.3B parameter models from scratch and performing large-scale evaluations, including short- and long-context benchmarks, as well as supervised fine-tuning~(SFT). We further propose a hybrid architecture that interleaves RAT with local attention. By combining efficient long-range modeling with strong local interactions, this hybrid design not only improves inference speed and reduces cache memory usage, but also consistently enhances performance and shows the overall best results. Code is available at https://github.com/CLAIRE-Labo/RAT.

View on arXiv PDF Code

Similar