A Stochastic Dynamical Theory of LLM Self-Adversariality: Modeling Severity Drift as a Critical Process
This addresses the issue of bias propagation in LLMs for AI safety and formal verification, but it is incremental as it builds on existing dynamical systems theory.
The paper tackles the problem of LLMs self-amplifying biases or toxicity through chain-of-thought reasoning by modeling it as a stochastic dynamical process, showing that certain parameters cause phase transitions from self-correcting to runaway severity and deriving stationary distributions and scaling laws.
This paper introduces a continuous-time stochastic dynamical framework for understanding how large language models (LLMs) may self-amplify latent biases or toxicity through their own chain-of-thought reasoning. The model posits an instantaneous "severity" variable $x(t) \in [0,1]$ evolving under a stochastic differential equation (SDE) with a drift term $μ(x)$ and diffusion $σ(x)$. Crucially, such a process can be consistently analyzed via the Fokker--Planck approach if each incremental step behaves nearly Markovian in severity space. The analysis investigates critical phenomena, showing that certain parameter regimes create phase transitions from subcritical (self-correcting) to supercritical (runaway severity). The paper derives stationary distributions, first-passage times to harmful thresholds, and scaling laws near critical points. Finally, it highlights implications for agents and extended LLM reasoning models: in principle, these equations might serve as a basis for formal verification of whether a model remains stable or propagates bias over repeated inferences.