CLAICVLGMay 15, 2025

Multi-Token Prediction Needs Registers

arXiv:2505.10518v110 citationsh-index: 21Has Code
Originality Incremental advance
AI Analysis

This addresses a bottleneck in improving language model training for researchers and practitioners, though it is incremental as it builds on existing multi-token prediction methods.

The paper tackles the inconsistent benefits of multi-token prediction in fine-tuning by proposing MuToR, a method using learnable register tokens to predict future targets, which achieves effectiveness in supervised fine-tuning, PEFT, and pretraining across language and vision tasks.

Multi-token prediction has emerged as a promising objective for improving language model pretraining, but its benefits have not consistently generalized to other settings such as fine-tuning. In this paper, we propose MuToR, a simple and effective approach to multi-token prediction that interleaves learnable register tokens into the input sequence, each tasked with predicting future targets. Compared to existing methods, MuToR offers several key advantages: it introduces only a negligible number of additional parameters, requires no architectural changes--ensuring compatibility with off-the-shelf pretrained language models--and remains aligned with the next-token pretraining objective, making it especially well-suited for supervised fine-tuning. Moreover, it naturally supports scalable prediction horizons. We demonstrate the effectiveness and versatility of MuToR across a range of use cases, including supervised fine-tuning, parameter-efficient fine-tuning (PEFT), and pretraining, on challenging generative tasks in both language and vision domains. Our code will be available at: https://github.com/nasosger/MuToR.

Code Implementations1 repo
Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes