Representation Without Control: Testing the Realization Effect in Language Models
For researchers using LLMs as behavioral simulators, this work shows that behavioral sensitivity and internal representation do not guarantee causal control, highlighting the need for multi-level evaluation.
The paper tests whether LLMs exhibit the realization effect from behavioral economics, finding that while prompt-only responses show condition sensitivity, the pattern does not match human predictions. Internal representations decode realization status, but causal steering fails to shift risk choices, indicating that latent readout does not imply behavioral reliance.
Large language models are increasingly used as behavioral simulators, but it remains unclear when their outputs reflect human-like cognitive mechanisms rather than prompt-sensitive surface patterns. We study this question through the realization effect, a well-characterized finding in behavioral economics in which risk-taking differs systematically after paper versus realized gains and losses. We evaluate LLM behavior at three levels: prompt-only behavioral sensitivity, linear readout of internal representations, and causal control via activation steering. Prompt-only results show systematic condition sensitivity, but the directional pattern does not reproduce human realization-effect predictions. Gemma's residual stream contains a linearly decodable realization-status signal at layer 18 that generalizes to held-out prompts. Steering along this direction does not, however, reliably shift downstream risk choices, a null result that holds across positive scales and in a negative sign-symmetry run. Behavioral sensitivity, latent readout, and causal control are three distinct properties that do not automatically co-occur, and successful latent readout is insufficient evidence that a model behaviorally relies on a representation during downstream decision-making.