AI LGOct 5, 2025

On the Importance of Task Complexity in Evaluating LLM-Based Multi-Agent Systems

Bohan Tang, Huidong Liang, Keyue Jiang, Xiaowen Dong

arXiv:2510.04311v112.43 citationsh-index: 5

Originality Incremental advance

AI Analysis

This provides a principled foundation for designing and benchmarking LLM-MAS, addressing a gap in systematic evaluation for researchers and practitioners.

The paper tackles the problem of evaluating LLM-based multi-agent systems by proposing a theoretical framework based on task complexity, showing that benefits over single-agent systems increase with task depth and width, with depth having a more pronounced effect.

Large language model multi-agent systems (LLM-MAS) offer a promising paradigm for harnessing collective intelligence to achieve more advanced forms of AI behaviour. While recent studies suggest that LLM-MAS can outperform LLM single-agent systems (LLM-SAS) on certain tasks, the lack of systematic experimental designs limits the strength and generality of these conclusions. We argue that a principled understanding of task complexity, such as the degree of sequential reasoning required and the breadth of capabilities involved, is essential for assessing the effectiveness of LLM-MAS in task solving. To this end, we propose a theoretical framework characterising tasks along two dimensions: depth, representing reasoning length, and width, representing capability diversity. We theoretically examine a representative class of LLM-MAS, namely the multi-agent debate system, and empirically evaluate its performance in both discriminative and generative tasks with varying depth and width. Theoretical and empirical results show that the benefit of LLM-MAS over LLM-SAS increases with both task depth and width, and the effect is more pronounced with respect to depth. This clarifies when LLM-MAS are beneficial and provides a principled foundation for designing future LLM-MAS methods and benchmarks.

View on arXiv PDF

Similar