LGDec 29, 2025

End-to-End Test-Time Training for Long Context

arXiv:2512.23675v224 citationsh-index: 21
Originality Incremental advance
AI Analysis

This addresses the problem of efficient long-context processing for language models, offering a novel test-time training approach that is incremental in combining existing ideas.

The paper tackles long-context language modeling by framing it as a continual learning problem, using a standard Transformer with sliding-window attention that learns at test time via next-token prediction, resulting in constant inference latency 2.7 times faster than full attention for 128K context and scaling similarly to full attention for 3B models.

We formulate long-context language modeling as a problem in continual learning rather than architecture design. Under this formulation, we only use a standard architecture -- a Transformer with sliding-window attention. However, our model continues learning at test time via next-token prediction on the given context, compressing the context it reads into its weights. In addition, we improve the model's initialization for learning at test time via meta-learning at training time. Overall, our method, a form of Test-Time Training (TTT), is End-to-End (E2E) both at test time (via next-token prediction) and training time (via meta-learning), in contrast to previous forms. We conduct extensive experiments with a focus on scaling properties. In particular, for 3B models trained with 164B tokens, our method (TTT-E2E) scales with context length in the same way as Transformer with full attention, while others, such as Mamba 2 and Gated DeltaNet, do not. However, similar to RNNs, TTT-E2E has constant inference latency regardless of context length, making it 2.7 times faster than full attention for 128K context. Our code is publicly available.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes