Beyond Multi-Token Prediction: Pretraining LLMs with Future Summaries
This addresses the problem of long-horizon reasoning and planning in LLMs for applications like creative writing and coding, though it is incremental as it builds on existing prediction methods.
The paper tackles the limitations of next-token and multi-token prediction in LLMs for long-horizon tasks by proposing future summary prediction (FSP), which trains an auxiliary head to predict compact representations of long-term future content, resulting in improvements over baseline methods across math, reasoning, and coding benchmarks in 3B and 8B-parameter models.
Next-token prediction (NTP) has driven the success of large language models (LLMs), but it struggles with long-horizon reasoning, planning, and creative writing, with these limitations largely attributed to teacher-forced training. Multi-token prediction (MTP) partially mitigates these issues by predicting several future tokens at once, but it mostly captures short-range dependencies and offers limited improvement. We propose future summary prediction (FSP), which trains an auxiliary head to predict a compact representation of the long-term future, preserving information relevant for long-form generations. We explore two variants of FSP: handcrafted summaries, for example, a bag of words summary of the future of the sequence, and learned summaries, which use embeddings produced by a reverse language model trained from right to left. Large-scale pretraining experiments (3B and 8B-parameter models) demonstrate that FSP provides improvements over both NTP and MTP across math, reasoning, and coding benchmarks.