CL LGFeb 12, 2025

Stop Overvaluing Multi-Agent Debate -- We Must Rethink Evaluation and Embrace Model Heterogeneity

Hangfan Zhang, Zhiyao Cui, Jianhao Chen, Xinrun Wang, Qiaosheng Zhang, Zhen Wang, Dinghao Wu, Shuyue Hu

arXiv:2502.08788v324.530 citationsh-index: 10

Originality Incremental advance

AI Analysis

This research tackles the problem of overvaluing multi-agent debate for whom large language model developers and researchers are, highlighting the need for a reevaluation of current evaluation practices and the importance of model heterogeneity.

The authors found that multi-agent debate (MAD) methods often fail to outperform simple single-agent baselines, despite consuming more computation, with results showing no significant improvement across 9 benchmarks. They discovered that model heterogeneity can consistently improve current MAD frameworks.

Multi-agent debate (MAD) has gained significant attention as a promising line of research to improve the factual accuracy and reasoning capabilities of large language models (LLMs). Despite its conceptual appeal, current MAD research suffers from critical limitations in evaluation practices, including limited benchmark coverage, weak baseline comparisons, and inconsistent setups. This paper presents a systematic evaluation of 5 representative MAD methods across 9 benchmarks using 4 foundational models. Surprisingly, our findings reveal that MAD often fail to outperform simple single-agent baselines such as Chain-of-Thought and Self-Consistency, even when consuming significantly more inference-time computation. To advance MAD research, we further explore the role of model heterogeneity and find it as a universal antidote to consistently improve current MAD frameworks. Based on our findings, we argue that the field must stop overvaluing MAD in its current form; for true advancement, we must critically rethink evaluation paradigms and actively embrace model heterogeneity as a core design principle.

View on arXiv PDF

Similar