Learning to (Learn at Test Time): RNNs with Expressive Hidden States
This work addresses the challenge of efficient long-context processing for AI applications, offering a novel approach but with incremental improvements over existing methods.
The authors tackled the problem of limited expressive power in RNN hidden states for long-context sequence modeling by introducing Test-Time Training (TTT) layers, where the hidden state is a machine learning model updated via self-supervised learning, resulting in TTT-MLP reducing perplexity with more tokens up to 16k context, unlike Mamba which plateaus.
Self-attention performs well in long context but has quadratic complexity. Existing RNN layers have linear complexity, but their performance in long context is limited by the expressive power of their hidden states. We present a practical framework for instantiating sequence modeling layers with linear complexity and expressive hidden states. The key idea is to make the hidden state a machine learning model itself, and the update rule a step of self-supervised learning. Since the hidden state is updated by training even on test sequences, our layers are called Test-Time Training (TTT) layers. We consider two instantiations: TTT-Linear and TTT-MLP, whose hidden state is a linear model and a two-layer MLP respectively. We evaluate our instantiations at the scale of 125M to 1.3B parameters, comparing with a strong Transformer and Mamba, a modern RNN. Similar to Transformer, TTT-Linear and TTT-MLP can keep reducing perplexity by conditioning on more tokens, while Mamba cannot after 16k context. TTT-MLP still faces challenges in memory I/O, but shows larger potential in long context, pointing to a promising direction for future research.