Brittle Minds, Fixable Activations: Understanding Belief Representations in Language Models
This addresses a critical gap in evaluating and aligning language models for safety, particularly in Theory of Mind tasks, though it is incremental in probing existing models.
The paper tackled the problem of understanding how language models internally represent mental states like beliefs, finding that model size and fine-tuning improve these structured representations, which are brittle but can be corrected through targeted activation edits.
Despite growing interest in Theory of Mind (ToM) tasks for evaluating language models (LMs), little is known about how LMs internally represent mental states of self and others. Understanding these internal mechanisms is critical - not only to move beyond surface-level performance, but also for model alignment and safety, where subtle misattributions of mental states may go undetected in generated outputs. In this work, we present the first systematic investigation of belief representations in LMs by probing models across different scales, training regimens, and prompts - using control tasks to rule out confounds. Our experiments provide evidence that both model size and fine-tuning substantially improve LMs' internal representations of others' beliefs, which are structured - not mere by-products of spurious correlations - yet brittle to prompt variations. Crucially, we show that these representations can be strengthened: targeted edits to model activations can correct wrong ToM inferences.