MA AI CL CYMar 3, 2025

MultiAgentBench: Evaluating the Collaboration and Competition of LLM agents

Kunlun Zhu, Hongyi Du, Zhaochen Hong, Xiaocheng Yang, Shuyi Guo, Zhe Wang, Zhenhailong Wang, Cheng Qian, Xiangru Tang, Heng Ji, Jiaxuan You

arXiv:2503.01935v129.6136 citationsh-index: 44Has CodeACL

Originality Incremental advance

AI Analysis

This provides a new benchmark for researchers and developers working on multi-agent LLM systems, though it is incremental as it builds on existing single-agent and domain-specific benchmarks.

The authors tackled the lack of comprehensive benchmarks for evaluating multi-agent coordination and competition in LLM-based systems by introducing MultiAgentBench, which measures task completion and collaboration quality across diverse scenarios. Results show GPT-4o-mini achieved the highest average task score, graph structures performed best among coordination protocols, and cognitive planning improved milestone achievement rates by 3%.

Large Language Models (LLMs) have shown remarkable capabilities as autonomous agents, yet existing benchmarks either focus on single-agent tasks or are confined to narrow domains, failing to capture the dynamics of multi-agent coordination and competition. In this paper, we introduce MultiAgentBench, a comprehensive benchmark designed to evaluate LLM-based multi-agent systems across diverse, interactive scenarios. Our framework measures not only task completion but also the quality of collaboration and competition using novel, milestone-based key performance indicators. Moreover, we evaluate various coordination protocols (including star, chain, tree, and graph topologies) and innovative strategies such as group discussion and cognitive planning. Notably, gpt-4o-mini reaches the average highest task score, graph structure performs the best among coordination protocols in the research scenario, and cognitive planning improves milestone achievement rates by 3%. Code and datasets are public available at https://github.com/MultiagentBench/MARBLE.

View on arXiv PDF Code

Similar