MA AIMar 20

When Agents Disagree: The Selection Bottleneck in Multi-Agent LLM Pipelines

arXiv:2603.2032461.2

Predicted impact top 45% in MA · last 90 daysOriginality Incremental advance

AI Analysis

This work addresses a key design problem for practitioners building multi-agent LLM systems, showing that selector quality can be more impactful than generator diversity, though it is incremental in refining existing pipeline approaches.

The paper tackles the contradictory evidence on whether team diversity improves output quality in multi-agent LLM pipelines by identifying a selection bottleneck and deriving a crossover threshold that determines when diversity helps or hurts. In experiments across 42 tasks, a diverse team with judge-based selection achieved a win rate of 0.810 against a single-model baseline, while a homogeneous team scored 0.512, and judge-based selection outperformed synthesis-based aggregation by a win rate difference of +0.631.

Multi-agent LLM pipelines produce contradictory evidence on whether team diversity improves output quality: heterogeneous Mixture-of-Agents teams outperform single models, yet homogeneous Self-MoA teams consistently win under synthesis-based aggregation. We propose a resolution by identifying the selection bottleneck -- a crossover threshold in aggregation quality that determines whether diversity helps or hurts. Under this model, we obtain a closed-form crossover threshold $s^*$ (Proposition 1) that separates the regimes where diversity helps and hurts. In a targeted experiment spanning 42 tasks across 7 categories ($N=210$), a diverse team with judge-based selection achieves a win rate of 0.810 against a single-model baseline, while a homogeneous team scores 0.512 -- near chance (Glass's $Î= 2.07$). Judge-based selection outperforms MoA-style synthesis by $Î_{\mathrm{WR}} = +0.631$ -- the synthesis approach is preferred over the baseline in zero of 42 tasks by the judge panel. A decoupled evaluation with independent judges confirms all directional findings (Spearman $Ï= 0.90$). Exploratory evidence suggests that including a weaker model improves performance while reducing cost ($p < 10^{-4}$, not pre-registered). Our results suggest that selector quality may be a more impactful design lever than generator diversity in single-round generate-then-select pipelines.

View on arXiv PDF

Similar