LGAIJun 8, 2025

Keeping Up with the Models: Online Deployment and Routing of LLMs at Scale

arXiv:2506.17254v12 citations
Originality Incremental advance
AI Analysis

This addresses the challenge for LLM service providers in efficiently deploying and routing models online, though it is incremental as it builds on existing online decision and bandit frameworks.

The paper tackles the problem of managing a stream of large language models (LLMs) for service providers under capacity and cost constraints by introducing StageRoute, an online algorithm that combines deployment and routing decisions, achieving a near-optimal regret bound of order T^{2/3} and performing close to optimum in experiments.

The rapid pace at which new large language models (LLMs) appear -- and older ones become obsolete -- forces LLM service providers to juggle a streaming inventory of models while respecting tight deployment capacity and per-query cost budgets. We cast the reality as an online decision problem that couples stage-wise deployment, made at fixed maintenance windows, with per-query routing among the models kept live. We introduce StageRoute, a hierarchical algorithm that (i) optimistically selects up to $M_max$ models for the next stage using reward upper-confidence and cost lower-confidence bounds, then (ii) solves a budget-constrained bandit sub-problem to route each incoming query. We prove that StageRoute achieves a regret of order $T^{2/3}$ and provide a matching lower bound, thereby establishing its near-optimality. Moreover, our experiments confirm the theory, demonstrating that StageRoute performs close to the optimum in practical settings.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes