CL AIMar 30

Theory of Mind and Self-Attributions of Mentality are Dissociable in LLMs

Junsol Kim, Winnie Street, Roberta Rocca, Daine M. Korngiebel, Adam Waytz, James Evans, Geoff Keeling

arXiv:2603.2892581.4h-index: 41

AI Analysis

This addresses safety concerns in AI by showing that suppressing harmful mind-attributions does not impair socio-cognitive functions, though it may affect ethical perspectives on non-human minds.

The study investigated whether safety fine-tuning in LLMs, which suppresses mind-attribution tendencies like claims of consciousness, degrades Theory of Mind (ToM) abilities, finding that these attributes are dissociable but that fine-tuned models under-attribute mind to non-human animals and spiritual beliefs.

Safety fine-tuning in Large Language Models (LLMs) seeks to suppress potentially harmful forms of mind-attribution such as models asserting their own consciousness or claiming to experience emotions. We investigate whether suppressing mind-attribution tendencies degrades intimately related socio-cognitive abilities such as Theory of Mind (ToM). Through safety ablation and mechanistic analyses of representational similarity, we demonstrate that LLM attributions of mind to themselves and to technological artefacts are behaviorally and mechanistically dissociable from ToM capabilities. Nevertheless, safety fine-tuned models under-attribute mind to non-human animals relative to human baselines and are less likely to exhibit spiritual belief, suppressing widely shared perspectives regarding the distribution and nature of non-human minds.

View on arXiv PDF

Similar