LG CLMay 9, 2025

Breaking Quadratic Barriers: A Non-Attention LLM for Ultra-Long Context Horizons

Andrew Kiruluta, Preethi Raju, Priscilla Burity

arXiv:2506.01963v11 citationsh-index: 3

Originality Highly original

AI Analysis

This addresses a critical bottleneck for scaling LLMs to ultra-long contexts, potentially enabling applications in long-document processing and multi-turn conversations, though it appears incremental as it builds on existing techniques like S4.

The authors tackled the problem of quadratic memory and computation overhead in Transformer-based LLMs for long contexts by developing a non-attention architecture that combines state space blocks, multi-resolution convolutions, a recurrent supervisor, and retrieval-augmented external memory, achieving efficient handling of hundreds of thousands to millions of tokens.

We present a novel non attention based architecture for large language models (LLMs) that efficiently handles very long context windows, on the order of hundreds of thousands to potentially millions of tokens. Unlike traditional Transformer designs, which suffer from quadratic memory and computation overload due to the nature of the self attention mechanism, our model avoids token to token attention entirely. Instead, it combines the following complementary components: State Space blocks (inspired by S4) that learn continuous time convolution kernels and scale near linearly with sequence length, Multi Resolution Convolution layers that capture local context at different dilation levels, a lightweight Recurrent Supervisor to maintain a global hidden state across sequential chunks, and Retrieval Augmented External Memory that stores and retrieves high-level chunk embeddings without reintroducing quadratic operations.

View on arXiv PDF

Similar