CLAIApr 21, 2024

NegotiationToM: A Benchmark for Stress-testing Machine Theory of Mind on Negotiation Surrounding

arXiv:2404.13627v352 citationsh-index: 17EMNLP
Originality Incremental advance
AI Analysis

This addresses the need for better benchmarks to assess ToM in AI for real-world applications like negotiation, though it is incremental as it builds on existing BDI theory.

The authors tackled the problem of evaluating machine Theory of Mind (ToM) in real-world human interaction scenarios by introducing NegotiationToM, a benchmark for stress-testing in negotiation settings, and found that state-of-the-art LLMs perform significantly worse than humans, even with chain-of-thought methods.

Large Language Models (LLMs) have sparked substantial interest and debate concerning their potential emergence of Theory of Mind (ToM) ability. Theory of mind evaluations currently focuses on testing models using machine-generated data or game settings prone to shortcuts and spurious correlations, which lacks evaluation of machine ToM ability in real-world human interaction scenarios. This poses a pressing demand to develop new real-world scenario benchmarks. We introduce NegotiationToM, a new benchmark designed to stress-test machine ToM in real-world negotiation surrounding covered multi-dimensional mental states (i.e., desires, beliefs, and intentions). Our benchmark builds upon the Belief-Desire-Intention (BDI) agent modeling theory and conducts the necessary empirical experiments to evaluate large language models. Our findings demonstrate that NegotiationToM is challenging for state-of-the-art LLMs, as they consistently perform significantly worse than humans, even when employing the chain-of-thought (CoT) method.

Code Implementations1 repo
Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes