AISep 28, 2025

Beyond the Strongest LLM: Multi-Turn Multi-Agent Orchestration vs. Single LLMs on Benchmarks

arXiv:2509.23537v24 citationsh-index: 20
Originality Incremental advance
AI Analysis

This work addresses the problem of enhancing LLM performance on benchmarks for AI researchers, but it is incremental as it builds on existing multi-agent and orchestration concepts.

The study tackled the problem of improving benchmark performance by comparing multi-turn multi-agent orchestration with single LLM baselines, finding that orchestration matches or exceeds the strongest single model and consistently outperforms others, with analysis showing potential for further gains.

We study multi-turn multi-agent orchestration, where multiple large language model (LLM) agents interact over multiple turns by iteratively proposing answers or casting votes until reaching consensus. Using four LLMs (Gemini 2.5 Pro, GPT-5, Grok 4, and Claude Sonnet 4) on GPQA-Diamond, IFEval, and MuSR, we conduct two experiments: (i) benchmarking orchestration against single-LLM baselines; and (ii) ablations on GPQA-Diamond that vary whether agents see who authored answers and whether they can observe ongoing votes. Orchestration matches or exceeds the strongest single model and consistently outperforms the others. Analysis of best-achievable orchestration performance shows potential for further gains. The ablations show that revealing authorship increases self-voting and ties, and that showing ongoing votes amplifies herding, which speeds convergence but can sometimes yield premature consensus.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes