UniToMBench: Integrating Perspective-Taking to Improve Theory of Mind in LLMs
This work addresses the problem of assessing and enhancing ToM capabilities in LLMs for AI researchers, though it is incremental as it builds on existing benchmarks like SimToM and TOMBENCH.
The paper tackles the challenge of improving Theory of Mind (ToM) in large language models by introducing UniToMBench, a unified benchmark that integrates multi-interaction tasks and evolving story scenarios, resulting in models like GPT-4o achieving over 80% accuracy in emotional and belief-related tasks but showing variability in knowledge-based tasks.
Theory of Mind (ToM), the ability to understand the mental states of oneself and others, remains a challenging area for large language models (LLMs), which often fail to predict human mental states accurately. In this paper, we introduce UniToMBench, a unified benchmark that integrates the strengths of SimToM and TOMBENCH to systematically improve and assess ToM capabilities in LLMs by integrating multi-interaction task designs and evolving story scenarios. Supported by a custom dataset of over 1,000 hand-written scenarios, UniToMBench combines perspective-taking techniques with diverse evaluation metrics to better stimulate social cognition in LLMs. Through evaluation, we observe that while models like GPT-4o and GPT-4o Mini show consistently high accuracy in tasks involving emotional and belief-related scenarios, with results usually above 80%, there is significant variability in their performance across knowledge-based tasks. These results highlight both the strengths and limitations of current LLMs in ToM-related tasks, underscoring the value of UniToMBench as a comprehensive tool for future development. Our code is publicly available here: https://github.com/Shamant/unifiedtombenchmark.