LGJun 21, 2025

Online Multi-LLM Selection via Contextual Bandits under Unstructured Context Evolution

Manhin Poon, XiangXiang Dai, Xutong Liu, Fang Kong, John C. S. Lui, Jinhang Zuo

arXiv:2506.17670v115 citationsh-index: 12

Originality Incremental advance

AI Analysis

This addresses the challenge of efficient and adaptive LLM selection for users in real-time applications, though it is incremental as it builds on existing contextual bandit methods.

The paper tackles the problem of adaptively selecting the best large language model (LLM) for user queries in an online setting with dynamically changing prompts, proposing a contextual bandit framework that achieves sublinear regret and outperforms existing routing strategies in accuracy and cost-efficiency.

Large language models (LLMs) exhibit diverse response behaviors, costs, and strengths, making it challenging to select the most suitable LLM for a given user query. We study the problem of adaptive multi-LLM selection in an online setting, where the learner interacts with users through multi-step query refinement and must choose LLMs sequentially without access to offline datasets or model internals. A key challenge arises from unstructured context evolution: the prompt dynamically changes in response to previous model outputs via a black-box process, which cannot be simulated, modeled, or learned. To address this, we propose the first contextual bandit framework for sequential LLM selection under unstructured prompt dynamics. We formalize a notion of myopic regret and develop a LinUCB-based algorithm that provably achieves sublinear regret without relying on future context prediction. We further introduce budget-aware and positionally-aware (favoring early-stage satisfaction) extensions to accommodate variable query costs and user preferences for early high-quality responses. Our algorithms are theoretically grounded and require no offline fine-tuning or dataset-specific training. Experiments on diverse benchmarks demonstrate that our methods outperform existing LLM routing strategies in both accuracy and cost-efficiency, validating the power of contextual bandits for real-time, adaptive LLM selection.

View on arXiv PDF

Similar