CoMMET: To What Extent Can LLMs Perform Theory of Mind Tasks?
This work addresses the need for better benchmarks to validate LLMs' social reasoning capabilities for more effective and natural interactions in real-world applications, though it is incremental as it builds on existing ToM evaluation methods.
The authors tackled the problem of evaluating Theory of Mind (ToM) in Large Language Models (LLMs) by proposing CoMMET, a new multimodal benchmark dataset that expands evaluation to cover a broader range of mental states and multi-turn testing, and they analyzed the strengths and limitations of current models through comprehensive assessments.
Theory of Mind (ToM)-the ability to reason about the mental states of oneself and others-is a cornerstone of human social intelligence. As Large Language Models (LLMs) become ubiquitous in real-world applications, validating their capacity for this level of social reasoning is essential for effective and natural interactions. However, existing benchmarks for assessing ToM in LLMs are limited; most rely solely on text inputs and focus narrowly on belief-related tasks. In this paper, we propose a new multimodal benchmark dataset, CoMMET, a Comprehensive Mental states and Moral Evaluation Task inspired by the Theory of Mind Booklet Task. CoMMET expands the scope of evaluation by covering a broader range of mental states and introducing multi-turn testing. To the best of our knowledge, this is the first multimodal dataset to evaluate ToM in a multi-turn conversational setting. Through a comprehensive assessment of LLMs across different families and sizes, we analyze the strengths and limitations of current models and identify directions for future improvement. Our work offers a deeper understanding of the social cognitive capabilities of modern LLMs.