CLAIJun 11, 2025

UniToMBench: Integrating Perspective-Taking to Improve Theory of Mind in LLMs

arXiv:2506.09450v11 citationsh-index: 1Has Code
Originality Synthesis-oriented
AI Analysis

This work addresses the problem of assessing and enhancing ToM capabilities in LLMs for AI researchers, though it is incremental as it builds on existing benchmarks like SimToM and TOMBENCH.

The paper tackles the challenge of improving Theory of Mind (ToM) in large language models by introducing UniToMBench, a unified benchmark that integrates multi-interaction tasks and evolving story scenarios, resulting in models like GPT-4o achieving over 80% accuracy in emotional and belief-related tasks but showing variability in knowledge-based tasks.

Theory of Mind (ToM), the ability to understand the mental states of oneself and others, remains a challenging area for large language models (LLMs), which often fail to predict human mental states accurately. In this paper, we introduce UniToMBench, a unified benchmark that integrates the strengths of SimToM and TOMBENCH to systematically improve and assess ToM capabilities in LLMs by integrating multi-interaction task designs and evolving story scenarios. Supported by a custom dataset of over 1,000 hand-written scenarios, UniToMBench combines perspective-taking techniques with diverse evaluation metrics to better stimulate social cognition in LLMs. Through evaluation, we observe that while models like GPT-4o and GPT-4o Mini show consistently high accuracy in tasks involving emotional and belief-related scenarios, with results usually above 80%, there is significant variability in their performance across knowledge-based tasks. These results highlight both the strengths and limitations of current LLMs in ToM-related tasks, underscoring the value of UniToMBench as a comprehensive tool for future development. Our code is publicly available here: https://github.com/Shamant/unifiedtombenchmark.

Code Implementations1 repo
Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes