CL AIJan 3, 2025

The Essence of Contextual Understanding in Theory of Mind: A Study on Question Answering with Story Characters

Chulun Zhou, Qiujing Wang, Mo Yu, Xiaoqian Yue, Rui Lu, Jiangnan Li, Yifan Zhou, Shunchi Zhang, Jie Zhou, Wai Lam

arXiv:2501.01705v214.79 citationsh-index: 11ACL

Originality Incremental advance

AI Analysis

This addresses the challenge of assessing nuanced contextual understanding in AI for psychological reasoning, but is incremental as it builds on existing ToM evaluation by adding a new benchmark.

The study tackled the problem of evaluating Theory-of-Mind (ToM) capabilities in machines by highlighting the oversight of personal background context in existing benchmarks, and introduced the CharToM benchmark with 1,035 questions based on classic novels, finding that humans perform dramatically better with novel knowledge while state-of-the-art LLMs like o1 and DeepSeek-R1 perform notably worse despite pre-training exposure.

Theory-of-Mind (ToM) is a fundamental psychological capability that allows humans to understand and interpret the mental states of others. Humans infer others' thoughts by integrating causal cues and indirect clues from broad contextual information, often derived from past interactions. In other words, human ToM heavily relies on the understanding about the backgrounds and life stories of others. Unfortunately, this aspect is largely overlooked in existing benchmarks for evaluating machines' ToM capabilities, due to their usage of short narratives without global context, especially personal background of characters. In this paper, we verify the importance of comprehensive contextual understanding about personal backgrounds in ToM and assess the performance of LLMs in such complex scenarios. To achieve this, we introduce CharToM benchmark, comprising 1,035 ToM questions based on characters from classic novels. Our human study reveals a significant disparity in performance: the same group of educated participants performs dramatically better when they have read the novels compared to when they have not. In parallel, our experiments on state-of-the-art LLMs, including the very recent o1 and DeepSeek-R1 models, show that LLMs still perform notably worse than humans, despite that they have seen these stories during pre-training. This highlights the limitations of current LLMs in capturing the nuanced contextual information required for ToM reasoning.

View on arXiv PDF

Similar