DialToM: A Theory of Mind Benchmark for Forecasting State-Driven Dialogue Trajectories

Neemesh Yadav, Palakorn Achananuparp, Jing Jiang, Ee-Peng Lim

arXiv:2604.2044369.4Has Code

AI Analysis

This work addresses a critical gap in evaluating LLMs' reasoning capabilities for AI and cognitive science, though it is incremental as it builds on existing ToM benchmarks.

The paper tackles the problem of assessing whether LLMs' Theory of Mind abilities stem from robust reasoning or spurious correlations by introducing DialToM, a benchmark for forecasting state-driven dialogue trajectories. Results show that while LLMs excel at identifying mental states, most fail to leverage this understanding for forecasting social trajectories, with only weak semantic similarities to human inferences.

Large Language Models (LLMs) have been shown to possess Theory of Mind (ToM) abilities. However, it remains unclear whether this stems from robust reasoning or spurious correlations. We introduce DialToM, a human-verified benchmark built from natural human dialogue using a multiple-choice framework. We evaluate not only mental state prediction (Literal ToM) but also the functional utility of these states (Functional ToM) through Prospective Diagnostic Forecasting -- probing whether models can identify state-consistent dialogue trajectories solely from mental-state profiles. Our results reveal a significant reasoning asymmetry: while LLMs excel at identifying mental states, most (except for Gemini 3 Pro) fail to leverage this understanding to forecast social trajectories. Additionally, we find only weak semantic similarities between human and LLM-generated inferences. To facilitate reproducibility, the DialToM dataset and evaluation code are publicly available at https://github.com/Stealth-py/DialToM.

View on arXiv PDF Code

Similar