CLAug 28, 2024

BattleAgentBench: A Benchmark for Evaluating Cooperation and Competition Capabilities of Language Models in Multi-Agent Systems

arXiv:2408.15971v119 citationsh-index: 13Has Code
Originality Incremental advance
AI Analysis

This work addresses the need for better evaluation of cooperation and competition in multi-agent systems for AI researchers, though it is incremental as it builds on existing benchmarks.

The authors tackled the lack of fine-grained evaluation and the omission of competitive scenarios in benchmarks for language models in multi-agent systems by proposing BattleAgentBench, which includes seven sub-stages across three difficulty levels, and found that API-based models excel on simple tasks but show significant room for improvement on difficult collaborative and competitive tasks.

Large Language Models (LLMs) are becoming increasingly powerful and capable of handling complex tasks, e.g., building single agents and multi-agent systems. Compared to single agents, multi-agent systems have higher requirements for the collaboration capabilities of language models. Many benchmarks are proposed to evaluate their collaborative abilities. However, these benchmarks lack fine-grained evaluations of LLM collaborative capabilities. Additionally, multi-agent collaborative and competitive scenarios are ignored in existing works. To address these two problems, we propose a benchmark, called BattleAgentBench, which defines seven sub-stages of three varying difficulty levels and conducts a fine-grained evaluation of language models in terms of single-agent scenario navigation capabilities, paired-agent task execution abilities, and multi-agent collaboration and competition capabilities. We conducted extensive evaluations on leading four closed-source and seven open-source models. Experimental results indicate that API-based models perform excellently on simple tasks but open-source small models struggle with simple tasks. Regarding difficult tasks that require collaborative and competitive abilities, although API-based models have demonstrated some collaborative capabilities, there is still enormous room for improvement.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes