LGAIMay 14

TwinRouterBench: Fast Static and Live Dynamic Evaluation for Realistic Agentic LLM Routing

arXiv:2605.1885980.7Has Code
Predicted impact top 15% in LG · last 90 daysOriginality Incremental advance
AI Analysis

For researchers and practitioners building cost-efficient LLM routing systems for long-horizon agentic tasks, this benchmark provides the first step-level evaluation with execution-verified targets and no online LLM judges.

TwinRouterBench introduces a step-level routing benchmark for LLM agents, with a static track (970 prefixes, deterministic scoring) and a dynamic track (SWE-bench harness, live API costs). It enables fast offline iteration and end-to-end validation, addressing the lack of realistic multi-step routing evaluation.

LLM routing matters most in long-horizon applications such as coding agents, deep research systems, and computer-use agents, where a single user request triggers many model calls. Routing each call to the cheapest sufficient model can cut costs without sacrificing quality, yet existing router benchmarks evaluate routers only on one-shot prompts. They never expose the router-visible prefix at an intermediate agent step, never test whether a cheaper replacement preserves downstream task success, and often rely on online LLM judges at evaluation time. We introduce TwinRouterBench, a step-level routing benchmark with two tracks. The static track provides 970 router-visible prefixes from 520 instances across SWE-bench, BFCL, mtRAG, QMSum, and PinchBench, each paired with an execution-verified target tier estimated under a released downgrade-and-cascade protocol; scoring is deterministic arithmetic over tier labels, trajectory membership, and token costs, with no online evaluator-side LLM judge. The dynamic track supplies a harness that runs routers on the full 500-case SWE-bench Verified suite; in this paper we report a 100-case held-out evaluation disjoint from the static SWE supervision split. At each LLM call the router selects a concrete model from a locked pool, and success is measured by official task resolution and realized API spend. The two tracks support fast offline iteration followed by end-to-end validation under live agent execution. Code and data are available at https://github.com/CommonstackAI/TwinRouterBench.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes