AIJan 13

M3-BENCH: Process-Aware Evaluation of LLM Agents Social Behaviors in Mixed-Motive Games

Sixiong Xie, Zhuofan Shi, Haiyang Shen, Gang Huang, Yun Ma, Xiang Jing

arXiv:2601.08462v17.54 citationsh-index: 2

Originality Incremental advance

AI Analysis

This addresses the need for better evaluation of social behaviors in AI agents for researchers and developers, though it is incremental as it builds on existing benchmarks by adding process analysis.

The paper tackles the lack of systematic evaluation for LLM agents' social behaviors like cooperation and deception by proposing M3-Bench, a multi-stage benchmark with a process-aware framework that analyzes behavioral trajectories, reasoning, and communication, and it reveals that some models show inconsistencies between outcomes and reasoning/communication.

As the capabilities of large language model (LLM) agents continue to advance, their advanced social behaviors, such as cooperation, deception, and collusion, call for systematic evaluation. However, existing benchmarks often emphasize a single capability dimension or rely solely on behavioral outcomes, overlooking rich process information from agents' decision reasoning and communicative interactions. To address this gap, we propose M3-Bench, a multi-stage benchmark for mixed-motive games, together with a process-aware evaluation framework that conducts synergistic analysis across three modules: BTA (Behavioral Trajectory Analysis), RPA (Reasoning Process Analysis), and CCA (Communication Content Analysis). Furthermore, we integrate the Big Five personality model and Social Exchange Theory to aggregate multi-dimensional evidence into interpretable social behavior portraits, thereby characterizing agents' personality traits and capability profiles beyond simple task scores or outcome-based metrics. Experimental results show that M3-Bench can reliably distinguish diverse social behavior competencies across models, and it reveals that some models achieve seemingly reasonable behavioral outcomes while exhibiting pronounced inconsistencies in their reasoning and communication.

View on arXiv PDF

Similar