AIFeb 10

Auditing Multi-Agent LLM Reasoning Trees Outperforms Majority Vote and LLM-as-Judge

Wei Yang, Shixuan Li, Heng Ping, Peiyu Zhang, Paul Bogdan, Jesse Thomason

arXiv:2602.09341v14 citationsh-index: 6

Originality Highly original

AI Analysis

This addresses the issue of brittle consensus in multi-agent reasoning for AI systems, offering a more robust aggregation method.

The paper tackled the problem of aggregating outputs in multi-agent LLM systems by replacing majority voting with AgentAuditor, a method that searches reasoning trees to resolve conflicts, resulting in up to 5% absolute accuracy improvement over majority vote and up to 3% over LLM-as-Judge across 5 settings.

Multi-agent systems (MAS) can substantially extend the reasoning capacity of large language models (LLMs), yet most frameworks still aggregate agent outputs with majority voting. This heuristic discards the evidential structure of reasoning traces and is brittle under the confabulation consensus, where agents share correlated biases and converge on the same incorrect rationale. We introduce AgentAuditor, which replaces voting with a path search over a Reasoning Tree that explicitly represents agreements and divergences among agent traces. AgentAuditor resolves conflicts by comparing reasoning branches at critical divergence points, turning global adjudication into efficient, localized verification. We further propose Anti-Consensus Preference Optimization (ACPO), which trains the adjudicator on majority-failure cases and rewards evidence-based minority selections over popular errors. AgentAuditor is agnostic to MAS setting, and we find across 5 popular settings that it yields up to 5% absolute accuracy improvement over a majority vote, and up to 3% over using LLM-as-Judge.

View on arXiv PDF

Similar