Small LLMs Do Not Learn a Generalizable Theory of Mind via Reinforcement Learning
This work addresses the problem of instilling human-like social intelligence in AI models, showing incremental limitations in current RL methods for small LLMs.
The paper investigates whether small-scale LLMs can develop a generalizable Theory of Mind (ToM) capability through reinforcement learning with verifiable rewards, finding that while in-distribution performance improves, the models fail to transfer to unseen ToM tasks, indicating narrow overfitting rather than true ToM acquisition.
Recent advancements in large language models (LLMs) have demonstrated emergent capabilities in complex reasoning, largely spurred by rule-based Reinforcement Learning (RL) techniques applied during the post-training. This has raised the question of whether similar methods can instill more nuanced, human-like social intelligence, such as a Theory of Mind (ToM), in LLMs. This paper investigates whether small-scale LLMs can acquire a robust and generalizable ToM capability through RL with verifiable rewards (RLVR). We conduct a systematic evaluation by training models on various combinations of prominent ToM datasets (HiToM, ExploreToM, FANToM) and testing for generalization on held-out datasets (e.g., OpenToM). Our findings indicate that small LLMs struggle to develop a generic ToM capability. While performance on in-distribution tasks improves, this capability fails to transfer to unseen ToM tasks with different characteristics. Furthermore, we demonstrate that prolonged RL training leads to models ``hacking'' the statistical patterns of the training datasets, resulting in significant performance gains on in-domain data but no change, or degradation of performance on out-of-distribution tasks. This suggests the learned behavior is a form of narrow overfitting rather than the acquisition of a true, abstract ToM capability.