LG CLJun 5, 2025

Diagonal Batching Unlocks Parallelism in Recurrent Memory Transformers for Long Contexts

Danil Sivtsov, Ivan Rodkin, Gleb Kuzmin, Yuri Kuratov, Ivan Oseledets

arXiv:2506.05229v14.1h-index: 12Has Code

Originality Incremental advance

AI Analysis

This work addresses inference latency and cost for real-world, long-context applications in AI, representing an incremental improvement to existing RMT methods.

The paper tackles the performance bottleneck in Recurrent Memory Transformers (RMTs) caused by sequential execution during long-context inference, and introduces Diagonal Batching to enable parallelism, resulting in a 3.3x speedup over standard full-attention LLaMA-1B and a 1.8x speedup over sequential RMT on 131,072-token sequences.

Transformer models struggle with long-context inference due to their quadratic time and linear memory complexity. Recurrent Memory Transformers (RMTs) offer a solution by reducing the asymptotic cost to linear time and constant memory usage. However, their memory update mechanism leads to sequential execution, causing a performance bottleneck. We introduce Diagonal Batching, a scheduling scheme that unlocks parallelism across segments in RMTs while preserving exact recurrence. This approach eliminates the sequential constraint, enabling efficient GPU inference even for single long-context inputs without complex batching and pipelining techniques. Because the technique is purely a run-time computation reordering, existing RMT models adopt it with no retraining. Applied to a LLaMA-1B ARMT model, Diagonal Batching yields a 3.3x speedup over standard full-attention LLaMA-1B and a 1.8x speedup over the sequential RMT implementation on 131,072-token sequences. By removing sequential bottleneck, Diagonal Batching reduces inference cost and latency, thereby strengthening RMTs as a practical solution for real-world, long-context applications.

View on arXiv PDF Code

Similar