Recurrent Memory-Augmented Transformers with Chunked Attention for Long-Context Language Modeling
This work addresses the challenge of scaling language models to long contexts for tasks like dialogue modeling and document understanding, representing an incremental improvement over existing methods.
The authors tackled the problem of long-context language modeling by proposing a Transformer architecture that integrates chunked local attention and a gated FIFO memory mechanism, resulting in efficient handling of both short-range and long-range dependencies without quadratic attention cost increases.
We present a Transformer architecture for long-context language modeling that combines global attention with two biologically inspired components: chunked local attention and a gated FIFO memory mechanism. This unified attention block allows the model to efficiently handle both short-range and long-range dependencies without increasing attention cost quadratically. The memory module persistently stores past token representations using a gated update mechanism inspired by recurrent networks. Rotary positional encoding is applied per attention head to enable directionally disentangled, scale-invariant positional signals. The architecture is implemented entirely from scratch in PyTorch, with no reliance on high-level libraries, enabling transparent and modular experimentation. Our model offers a lightweight and extensible design for tasks such as dialogue modeling, code completion, and document understanding.