LG CL MLFeb 28, 2024

Implicit Optimization Bias of Next-Token Prediction in Linear Models

arXiv:2402.18551v215.716 citationsh-index: 1NIPS

Originality Incremental advance

AI Analysis

This work addresses the foundational understanding of training dynamics for language models, though it is incremental by extending prior research on implicit bias to the NTP setting.

The paper investigates the optimization bias of next-token prediction (NTP) in linear models, showing that gradient descent selects parameters that equate logit differences to log-odds in the data subspace and diverge in norm in the orthogonal subspace to maximize an NTP-specific margin.

We initiate an investigation into the optimization properties of next-token prediction (NTP), the dominant training paradigm for modern language models. Specifically, we study the structural properties of the solutions selected by gradient-based optimizers among the many possible minimizers of the NTP objective. By framing NTP as cross-entropy minimization across distinct contexts, each tied with a sparse conditional probability distribution across a finite vocabulary of tokens, we introduce "NTP-separability conditions" that enable reaching the data-entropy lower bound. With this setup, and focusing on linear models with fixed context embeddings, we characterize the optimization bias of gradient descent (GD): Within the data subspace defined by the sparsity patterns of distinct contexts, GD selects parameters that equate the logits' differences of in-support tokens to their log-odds. In the orthogonal subspace, the GD parameters diverge in norm and select the direction that maximizes a margin specific to NTP. These findings extend previous research on implicit bias in one-hot classification to the NTP setting, highlighting key differences and prompting further research into the optimization and generalization properties of NTP, irrespective of the specific architecture used to generate the context embeddings.

View on arXiv PDF

Similar