Mixture-of-Models: Unifying Heterogeneous Agents via N-Way Self-Evaluating Deliberation
This addresses the problem of high computational costs for large AI models, offering a hardware-efficient alternative that could benefit researchers and developers with limited resources.
The paper tackles the problem of efficiently combining multiple smaller AI models to match or exceed the performance of much larger models, introducing the N-Way Self-Evaluating Deliberation (NSED) protocol. Results show that ensembles of small (<20B parameter) models can match or exceed state-of-the-art 100B+ parameter models on benchmarks like AIME 2025 and LiveCodeBench, while also reducing sycophancy scores on DarkBench.
This paper introduces the N-Way Self-Evaluating Deliberation (NSED) protocol, a Runtime Mixture-of-Models (MoM) architecture that constructs emergent composite models from a plurality of distinct expert agents. Unlike traditional Mixture-of-Experts (MoE) which rely on static gating networks, NSED employs a Dynamic Expertise Broker - a runtime optimization engine that treats model selection as a variation of the Knapsack Problem, binding heterogeneous checkpoints to functional roles based on live telemetry and cost constraints. At the execution layer, we formalize deliberation as a Macro-Scale Recurrent Neural Network (RNN), where the consensus state loops back through a semantic forget gate to enable iterative refinement without proportional VRAM scaling. Key components include an orchestration fabric for trustless N-to-N peer review, a Quadratic Voting activation function for non-linear consensus, and a feedback-driven state update. Empirical validation on challenging benchmarks (AIME 2025, LiveCodeBench) demonstrates that this topology allows ensembles of small (less than 20B) consumer-grade models to match or exceed the performance of state-of-the-art 100B+ parameter models, establishing a new hardware arbitrage efficiency frontier. Furthermore, testing on the DarkBench safety suite reveals intrinsic alignment properties, with peer-mediated correction reducing sycophancy scores below that of any individual agent.