Unsolvability Ceiling in Multi-LLM Routing: An Empirical Study of Evaluation Artifacts

arXiv:2605.0739551.2

Predicted impact top 49% in LG · last 90 daysOriginality Incremental advance

AI Analysis

For researchers building multi-LLM routing systems, this work reveals that prior headroom estimates are inflated due to flawed evaluation, highlighting the need for more reliable protocols.

The paper shows that reported 'unsolvability' in multi-LLM routing is largely due to evaluation artifacts (judge bias, truncation, format mismatches), not genuine model limitations. After correcting for these artifacts, measured unsolvability drops significantly, and router training signals are distorted, causing a 13-17 percentage point opportunity cost.

Efficient routing across multiple LLMs enables cost-quality tradeoffs by directing queries to the cheapest capable model. Prior work attributes routing headroom to an "unsolvability ceiling", queries no model in the pool can solve. We present a large-scale study of multi-tier LLM routing with 206,000 query-model pairs across six benchmarks (MMLU, MedQA, HumanEval, MBPP, Alpaca, ShareGPT) using the Gemma 4 and Llama 3.1 families. Evaluating with both LLM-as-a-judge and exact-match metrics, we show that a substantial portion of reported unsolvability stems from evaluation artifacts: (i) systematic judge biases favoring verbosity over correctness, (ii) truncation under fixed generation budgets, and (iii) output format mismatches. Through dual-judge validation and exact-match grounding, we reduce measured unsolvability across tasks. We introduce a decomposition framework attributing failures to these artifacts, revealing consistent patterns across domains and model families. These artifacts also distort router training signals: standard routers collapse to majority-class prediction (~79% smallest-tier optimal), confirmed via random-feature and shuffled-label controls, incurring a 13-17 percentage point opportunity cost. We provide actionable recommendations including dual-judge validation, exact-match anchoring, and cost-sensitive objectives. Our findings suggest existing routing headroom estimates are substantially inflated, underscoring the need for reliable evaluation protocols in multi-LLM systems.

View on arXiv PDF

Similar