CL AIApr 21, 2024

NegotiationToM: A Benchmark for Stress-testing Machine Theory of Mind on Negotiation Surrounding

Chunkit Chan, Cheng Jiayang, Yauwai Yim, Zheye Deng, Wei Fan, Haoran Li, Xin Liu, Hongming Zhang, Weiqi Wang, Yangqiu Song

arXiv:2404.13627v321.956 citationsh-index: 17Has CodeEMNLP

Originality Incremental advance

AI Analysis

This addresses the need for better benchmarks to assess ToM in AI for real-world applications like negotiation, though it is incremental as it builds on existing BDI theory.

The authors tackled the problem of evaluating machine Theory of Mind (ToM) in real-world human interaction scenarios by introducing NegotiationToM, a benchmark for stress-testing in negotiation settings, and found that state-of-the-art LLMs perform significantly worse than humans, even with chain-of-thought methods.

Large Language Models (LLMs) have sparked substantial interest and debate concerning their potential emergence of Theory of Mind (ToM) ability. Theory of mind evaluations currently focuses on testing models using machine-generated data or game settings prone to shortcuts and spurious correlations, which lacks evaluation of machine ToM ability in real-world human interaction scenarios. This poses a pressing demand to develop new real-world scenario benchmarks. We introduce NegotiationToM, a new benchmark designed to stress-test machine ToM in real-world negotiation surrounding covered multi-dimensional mental states (i.e., desires, beliefs, and intentions). Our benchmark builds upon the Belief-Desire-Intention (BDI) agent modeling theory and conducts the necessary empirical experiments to evaluate large language models. Our findings demonstrate that NegotiationToM is challenging for state-of-the-art LLMs, as they consistently perform significantly worse than humans, even when employing the chain-of-thought (CoT) method.

View on arXiv PDF Code

Similar