CL AI LGMar 11, 2024

The pitfalls of next-token prediction

arXiv:2403.06963v330.6177 citationsh-index: 15Has CodeICML

Originality Highly original

AI Analysis

It addresses a foundational problem in language modeling that could affect all of ML/AI, though the evidence is preliminary.

The paper identifies a failure in next-token prediction training for certain tasks, showing that teacher-forcing can fail to learn accurate predictors, and provides preliminary evidence that a multi-token objective resolves this issue.

Can a mere next-token predictor faithfully model human intelligence? We crystallize this emerging concern and correct popular misconceptions surrounding it, and advocate a simple multi-token objective. As a starting point, we argue that the two often-conflated phases of next-token prediction -- autoregressive inference and teacher-forced training -- must be treated distinctly. The popular criticism that errors can compound during autoregressive inference, crucially assumes that teacher-forcing has learned an accurate next-token predictor. This assumption sidesteps a more deep-rooted problem we expose: in certain classes of tasks, teacher-forcing can simply fail to learn an accurate next-token predictor in the first place. We describe a general mechanism of how teacher-forcing can fail, and design a minimal planning task where both the Transformer and the Mamba architecture empirically fail in that manner -- remarkably, despite the task being straightforward to learn. Finally, we provide preliminary evidence that this failure can be resolved using _teacherless_ training, a simple modification using dummy tokens that predicts multiple tokens in advance. We hope this finding can ground future debates and inspire explorations beyond the next-token prediction paradigm. We make our code available under https://github.com/gregorbachmann/Next-Token-Failures

View on arXiv PDF Code

Similar