Manifold Trajectories in Next-Token Prediction: From Replicator Dynamics to Softmax Equilibrium
This provides a theoretical foundation for decoding methods in language models, offering precise insights for practitioners, though it is incremental as it builds on existing variational and dynamical systems concepts.
The paper formalizes next-token prediction in large language models as a constrained variational principle, proving that the next-token distribution follows a smooth trajectory converging to softmax equilibrium. It shows that temperature rescales time along this trajectory, and top-k and nucleus sampling restrict the flow with identical guarantees.
Decoding in large language models is often described as scoring tokens and normalizing with softmax. We give a minimal, self-contained account of this step as a constrained variational principle on the probability simplex. The discrete, normalization-respecting ascent is the classical multiplicative-weights (entropic mirror) update; its continuous-time limit is the replicator flow. From these ingredients we prove that, for a fixed context and temperature, the next-token distribution follows a smooth trajectory inside the simplex and converges to the softmax equilibrium. This formalizes the common ``manifold traversal'' intuition at the output-distribution level. The analysis yields precise, practice-facing consequences: temperature acts as an exact rescaling of time along the same trajectory, while top-k and nucleus sampling restrict the flow to a face with identical guarantees. We also outline a controlled account of path-dependent score adjustments and their connection to loop-like, hallucination-style behavior. We make no claims about training dynamics or internal representations; those are deferred to future work.