LG AIMay 14

TwinRouterBench: Fast Static and Live Dynamic Evaluation for Realistic Agentic LLM Routing

Pei Yang, Wanyi Chen, Tongyun Yang, Pengbin Feng, Jiarong Xing, Wentao Guo, Yuhang Yao, Yuhang Han, Hanchen Li, Xu Wang, Zeyu Wang, Jie Xiao

arXiv:2605.1885980.7Has Code

Predicted impact top 15% in LG · last 90 daysOriginality Incremental advance

AI Analysis

For researchers and practitioners building cost-efficient LLM routing systems for long-horizon agentic tasks, this benchmark provides the first step-level evaluation with execution-verified targets and no online LLM judges.

TwinRouterBench introduces a step-level routing benchmark for LLM agents, with a static track (970 prefixes, deterministic scoring) and a dynamic track (SWE-bench harness, live API costs). It enables fast offline iteration and end-to-end validation, addressing the lack of realistic multi-step routing evaluation.

LLM routing matters most in long-horizon applications such as coding agents, deep research systems, and computer-use agents, where a single user request triggers many model calls. Routing each call to the cheapest sufficient model can cut costs without sacrificing quality, yet existing router benchmarks evaluate routers only on one-shot prompts. They never expose the router-visible prefix at an intermediate agent step, never test whether a cheaper replacement preserves downstream task success, and often rely on online LLM judges at evaluation time. We introduce TwinRouterBench, a step-level routing benchmark with two tracks. The static track provides 970 router-visible prefixes from 520 instances across SWE-bench, BFCL, mtRAG, QMSum, and PinchBench, each paired with an execution-verified target tier estimated under a released downgrade-and-cascade protocol; scoring is deterministic arithmetic over tier labels, trajectory membership, and token costs, with no online evaluator-side LLM judge. The dynamic track supplies a harness that runs routers on the full 500-case SWE-bench Verified suite; in this paper we report a 100-case held-out evaluation disjoint from the static SWE supervision split. At each LLM call the router selects a concrete model from a locked pool, and success is measured by official task resolution and realized API spend. The two tracks support fast offline iteration followed by end-to-end validation under live agent execution. Code and data are available at https://github.com/CommonstackAI/TwinRouterBench.

View on arXiv PDF Code

Similar