A Statistical Physics of Language Model Reasoning
This provides a theoretical tool for studying and predicting reasoning failures in language models, which is incremental as it builds on existing mechanistic understanding efforts.
The authors tackled the problem of understanding emergent reasoning in transformer language models by proposing a statistical physics framework to model reasoning dynamics as a stochastic dynamical system on a lower-dimensional manifold, finding that a rank-40 projection explains about 50% of variance and identifying four latent reasoning regimes.
Transformer LMs show emergent reasoning that resists mechanistic understanding. We offer a statistical physics framework for continuous-time chain-of-thought reasoning dynamics. We model sentence-level hidden state trajectories as a stochastic dynamical system on a lower-dimensional manifold. This drift-diffusion system uses latent regime switching to capture diverse reasoning phases, including misaligned states or failures. Empirical trajectories (8 models, 7 benchmarks) show a rank-40 projection (balancing variance capture and feasibility) explains ~50% variance. We find four latent reasoning regimes. An SLDS model is formulated and validated to capture these features. The framework enables low-cost reasoning simulation, offering tools to study and predict critical transitions like misaligned states or other LM failures.