Keeping Up with the Models: Online Deployment and Routing of LLMs at Scale
This addresses the challenge for LLM service providers in efficiently deploying and routing models online, though it is incremental as it builds on existing online decision and bandit frameworks.
The paper tackles the problem of managing a stream of large language models (LLMs) for service providers under capacity and cost constraints by introducing StageRoute, an online algorithm that combines deployment and routing decisions, achieving a near-optimal regret bound of order T^{2/3} and performing close to optimum in experiments.
The rapid pace at which new large language models (LLMs) appear -- and older ones become obsolete -- forces LLM service providers to juggle a streaming inventory of models while respecting tight deployment capacity and per-query cost budgets. We cast the reality as an online decision problem that couples stage-wise deployment, made at fixed maintenance windows, with per-query routing among the models kept live. We introduce StageRoute, a hierarchical algorithm that (i) optimistically selects up to $M_max$ models for the next stage using reward upper-confidence and cost lower-confidence bounds, then (ii) solves a budget-constrained bandit sub-problem to route each incoming query. We prove that StageRoute achieves a regret of order $T^{2/3}$ and provide a matching lower bound, thereby establishing its near-optimality. Moreover, our experiments confirm the theory, demonstrating that StageRoute performs close to the optimum in practical settings.