DBAILGSep 2, 2025

Efficient Training-Free Online Routing for High-Volume Multi-LLM Serving

arXiv:2509.02718v27 citationsh-index: 12Has Code
Originality Incremental advance
AI Analysis

This addresses cost efficiency for LLM service providers by enabling efficient online routing without training, though it is incremental as it builds on existing routing concepts.

The paper tackles the problem of high deployment and computation costs for LLM services by introducing a training-free online routing algorithm that directs queries to optimal LLMs, achieving an average 3.55x improvement in overall performance and 1.85x in cost efficiency.

Increasing demand for Large Language Models (LLMs) services imposes substantial deployment and computation costs on providers. LLM routing offers a cost-efficient solution by directing queries to the optimal LLM based on model and query features. However, existing works primarily focus on offline scenarios and struggle to adapt to online settings with high query volume and constrained token budgets. In this work, we introduce the first training-free algorithm for online routing scenarios. Our algorithm leverages approximate nearest neighbor search to efficiently estimate query features and performs a one-time optimization over a small set of initial queries to learn a routing strategy that guides future routing. We provide theoretical guarantees demonstrating that our algorithm achieves a competitive ratio of $1 - o(1)$ under natural assumptions, which is further validated by extensive experiments across 3 benchmark datasets and 8 baselines, showing an average improvement of 3.55$\times$ in overall performance, 1.85$\times$ in cost efficiency, and nearly 4.25$\times$ in throughput. Our code is available at https://github.com/fzwark/PORT.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes