LGNov 8, 2025

Next-Latent Prediction Transformers Learn Compact World Models

Jayden Teoh, Manan Tomar, Kwangjun Ahn, Edward S. Hu, Pratyusha Sharma, Riashat Islam, Alex Lamb, John Langford

arXiv:2511.05963v121.311 citationsh-index: 11

Originality Incremental advance

AI Analysis

This addresses a foundational issue in sequence modeling for AI by improving transformer generalization through compact world models, though it is an incremental enhancement to existing training methods.

The paper tackles the problem of transformers lacking an inherent incentive to compress history into compact latent states, which leads to poor generalization, by introducing Next-Latent Prediction (NextLat), an auxiliary objective that trains transformers to learn predictive latent representations. The result is significant gains in downstream accuracy, representation compression, and lookahead planning across benchmarks for world modeling, reasoning, planning, and language modeling.

Transformers replace recurrence with a memory that grows with sequence length and self-attention that enables ad-hoc look ups over past tokens. Consequently, they lack an inherent incentive to compress history into compact latent states with consistent transition rules. This often leads to learning solutions that generalize poorly. We introduce Next-Latent Prediction (NextLat), which extends standard next-token training with self-supervised predictions in the latent space. Specifically, NextLat trains a transformer to learn latent representations that are predictive of its next latent state given the next output token. Theoretically, we show that these latents provably converge to belief states, compressed information of the history necessary to predict the future. This simple auxiliary objective also injects a recurrent inductive bias into transformers, while leaving their architecture, parallel training, and inference unchanged. NextLat effectively encourages the transformer to form compact internal world models with its own belief states and transition dynamics -- a crucial property absent in standard next-token prediction transformers. Empirically, across benchmarks targeting core sequence modeling competencies -- world modeling, reasoning, planning, and language modeling -- NextLat demonstrates significant gains over standard next-token training in downstream accuracy, representation compression, and lookahead planning. NextLat stands as a simple and efficient paradigm for shaping transformer representations toward stronger generalization.

View on arXiv PDF

Similar