LG AIOct 16, 2025

Beyond Multi-Token Prediction: Pretraining LLMs with Future Summaries

Divyat Mahajan, Sachin Goyal, Badr Youbi Idrissi, Mohammad Pezeshki, Ioannis Mitliagkas, David Lopez-Paz, Kartik Ahuja

Meta AI

arXiv:2510.14751v19 citationsh-index: 34

Originality Incremental advance

AI Analysis

This addresses the problem of long-horizon reasoning and planning in LLMs for applications like creative writing and coding, though it is incremental as it builds on existing prediction methods.

The paper tackles the limitations of next-token and multi-token prediction in LLMs for long-horizon tasks by proposing future summary prediction (FSP), which trains an auxiliary head to predict compact representations of long-term future content, resulting in improvements over baseline methods across math, reasoning, and coding benchmarks in 3B and 8B-parameter models.

Next-token prediction (NTP) has driven the success of large language models (LLMs), but it struggles with long-horizon reasoning, planning, and creative writing, with these limitations largely attributed to teacher-forced training. Multi-token prediction (MTP) partially mitigates these issues by predicting several future tokens at once, but it mostly captures short-range dependencies and offers limited improvement. We propose future summary prediction (FSP), which trains an auxiliary head to predict a compact representation of the long-term future, preserving information relevant for long-form generations. We explore two variants of FSP: handcrafted summaries, for example, a bag of words summary of the future of the sequence, and learned summaries, which use embeddings produced by a reverse language model trained from right to left. Large-scale pretraining experiments (3B and 8B-parameter models) demonstrate that FSP provides improvements over both NTP and MTP across math, reasoning, and coding benchmarks.

View on arXiv PDF

Similar