SEApr 19

Multi-LLM Orchestration for High-Quality Code Generation: Exploiting Complementary Model Strengths

Huashan Chen, Zhenyu Qi, Haotang Li, Hong Chen, Jinfu Chen, Kebin Peng, In Kee Kim, Kyu Hyung Lee, Sen He, Weiyi Shang

arXiv:2510.0137990.7h-index: 5

AI Analysis

For developers and researchers using LLMs for code generation, PerfOrch demonstrates that structured multi-model collaboration yields higher correctness and efficiency than any single model, with generalizable patterns across benchmarks.

PerfOrch, a multi-LLM orchestration system that routes code generation tasks to the most suitable model per stage and category, achieves pass@1 rates of 97.19% on HumanEval-X and 95.83% on EffiBench-X, outperforming the best single-model pipeline by 1.22-14.58 percentage points across languages.

Large Language Models (LLMs) have become central to automated code generation, yet existing approaches operate within a single-LLM paradigm: one model is selected and applied throughout the entire generation process. We observe that different LLMs exhibit complementary strengths: no single model dominates across all programming languages, algorithmic problem categories, or development stages. Multi-LLM collaboration, structured as per-stage, per-category routing rather than majority voting, produces higher-quality code than any individual model. Based on this observation, we propose PerfOrch, a multi-agent orchestration system that decomposes code generation into four collaborative agents: categorization, generation, debugging, and refinement. Each agent maintains a Memory module: a ranking matrix indexed by programming language and problem category, constructed from offline profiling and consulted at runtime to select the most suitable model for each task. We evaluate PerfOrch on two benchmarks, HumanEval-X and EffiBench-X, totaling 2,500 problems across five languages (Python, Java, C++, Go, and Rust). PerfOrch achieves average pass@1 rates of 97.19% on HumanEval-X and 95.83% on EffiBench-X, improving over the strongest single-model pipeline by 1.22-14.58 percentage points across languages. Notably, Memory rankings constructed solely from HumanEval-X profiling generalize to the entirely unseen EffiBench-X benchmark without re-profiling, demonstrating that the complementary-strength patterns PerfOrch exploits are properties of the models rather than artifacts of a specific problem distribution. Beyond correctness, PerfOrch improves execution time for 61-90% of solved problems with mean speedups of 4.7-29.9%, matching the refinement coverage of exhaustive multi-model evaluation at roughly half the token cost.

View on arXiv PDF

Similar