CL AI MAJun 26, 2025

Theory of Mind in Action: The Instruction Inference Task

Fardin Saad, Pradeep K. Murukannaiah, Munindar P. Singh

arXiv:2507.02935v12.7h-index: 16

Originality Incremental advance

AI Analysis

This addresses the challenge of enabling effective human-AI collaboration through ToM reasoning, though it is incremental as it builds on existing LLM methods.

The paper tackles the problem of assessing Theory of Mind (ToM) in collaborative environments by introducing the Instruction Inference task, where an agent interprets ambiguous instructions to assist a principal, and finds that their LLM-based agent Tomcat with few-shot chain-of-thought reasoning achieves performance comparable to human participants.

The Theory of Mind (ToM) refers to an agent's capacity to infer the mental states of other agents. ToM is essential for effective collaboration. To assess ToM in a dynamic, goal-oriented, and collaborative environment, we introduce a novel task, Instruction Inference, in which an agent assists a principal in reaching a goal by interpreting indirect or ambiguous instructions. We present Tomcat, an LLM-based agent, designed to exhibit ToM reasoning in interpreting and responding to the principal's instructions. We implement two variants of Tomcat. One, dubbed Fs-CoT, is based on a small number of examples (i.e., few-shot or Fs) demonstrating the requisite structured reasoning (i.e., chain-of-thought or CoT). One, dubbed CP, relies on commonsense knowledge and information about the problem (i.e., commonsense prompt or CP). We realized both variants of Tomcat on three leading large language models (LLMs), namely, GPT-4o, DeepSeek-R1, and Gemma-3-27B. To evaluate the effectiveness of Tomcat, we conducted a study with 52 human participants in which we provided participants with the same information as the CP variant of Tomcat. We computed intent accuracy, action optimality, and planning optimality to measure the ToM capabilities of Tomcat and our study participants. We found that Tomcat with Fs-CoT, particularly with GPT-4o and DeepSeek-R1, achieves performance comparable to the human participants, underscoring its ToM potential for human-AI collaboration.

View on arXiv PDF

Similar