MOSAIC: Efficient Mixture-of-Agent Scheduling via Adaptive Aggregation and Inference Concurrency
For practitioners deploying MoA systems with limited GPU resources, MOSAIC addresses load imbalances caused by skewed expert demand and variable generation lengths, significantly improving throughput.
MOSAIC accelerates Mixture-of-Agents (MoA) workloads on limited GPUs by jointly optimizing expert placement and prompt assignment via an ILP scheduler, and using confidence-aware adaptive aggregation to skip the aggregator LLM for consensus queries. On a 4-GPU system, it achieves 1.7–2.3x end-to-end speedups over the baseline while matching accuracy within 0.1 percentage points.
Mixture-of-Agents (MoA) systems improve reasoning accuracy by routing each query to multiple expert LLMs and aggregating their outputs. Efficiently executing this workload on limited GPU resources has bottlenecks. Skill-based routing creates skewed expert demand, and combining instruction-tuned LLMs with long-reasoning models results in extreme variability in generation lengths. Consequently, traditional scheduling strategies suffer from significant GPU idling and throughput collapse due to load imbalances. We present MOSAIC, a scheduling framework to accelerate MoA workloads. First, we formulate an Integer Linear Program (ILP) based scheduler that jointly optimizes expert placement and per-worker prompt assignment from offline-profiled costs, replicating reasoning experts across workers while pinning lightweight ones. Second, MOSAIC uses confidence-aware adaptive aggregation, leveraging inter-expert agreement to bypass the heavy final aggregator LLM for consensus queries. In our 4-GPU system, MOSAIC achieves up to 2.5x expert-stage, 4.23x aggregator-stage and 1.7~2.3x end-to-end speedups over the baseline scheduler, while matching accuracy within 0.1pp.