The Shape of Wisdom: Decision Trajectories in Language Models
For researchers studying language model interpretability, this provides a reproducible framework to distinguish between stable and fragile correct answers, though the findings are incremental and descriptive rather than offering a full mechanistic explanation.
The paper analyzes how answer scores evolve across transformer layers in language models, finding that correctness and stability are distinct: most correct answers are unstable. It introduces a method to quantify decision trajectories and identifies that attention contributes positively to correct answers while MLPs do not, with span deletion experiments confirming these effects.
Language models do not simply choose an answer at the output layer. In a 9,000-trajectory MMLU study across Qwen2.5-7B-Instruct, Llama-3.1-8B-Instruct, and Mistral-7B-Instruct-v0.3, the score of the answer moves across depth in structured ways. We describe each trajectory with three quantities: the current answer margin, the next-layer change in that margin, and the distance from a decision flip. The main empirical picture is that correctness and stability are different: the largest group is unstable-correct, not stable-correct. A traced subset then asks what moves the margin. In stable-correct cases, the average attention scalar points in the correct direction, while the average MLP scalar does not; span deletion shows that removing answer-supporting text hurts the margin and removing distractor-like text helps it. The result is not a full circuit explanation. It is a reproducible way to see which answers are settled, which remain fragile, and which measured sources move them.