CLAIFeb 12

Towards Fair and Comprehensive Evaluation of Routers in Collaborative LLM Systems

arXiv:2602.11877v11 citationsh-index: 4
Originality Highly original
AI Analysis

This addresses the need for fair and comprehensive router evaluation in collaborative LLM systems, offering a novel method for a known bottleneck.

The paper tackles the problem of unsystematic evaluation of routers in collaborative LLM systems by proposing RouterXBench, a framework with three dimensions, and introduces ProbeDirichlet, a router using internal hidden states, which achieves 16.68% and 18.86% relative improvements over baselines in router ability and high-accuracy scenarios.

Large language models (LLMs) have achieved success, but cost and privacy constraints necessitate deploying smaller models locally while offloading complex queries to cloud-based models. Existing router evaluations are unsystematic, overlooking scenario-specific requirements and out-of-distribution robustness. We propose RouterXBench, a principled evaluation framework with three dimensions: router ability, scenario alignment, and cross-domain robustness. Unlike prior work that relies on output probabilities or external embeddings, we utilize internal hidden states that capture model uncertainty before answer generation. We introduce ProbeDirichlet, a lightweight router that aggregates cross-layer hidden states via learnable Dirichlet distributions with probabilistic training. Trained on multi-domain data, it generalizes robustly across in-domain and out-of-distribution scenarios. Our results show ProbeDirichlet achieves 16.68% and 18.86% relative improvements over the best baselines in router ability and high-accuracy scenarios, with consistent performance across model families, model scales, heterogeneous tasks, and agentic workflows.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes