LG AIFeb 3, 2025

Large Language Model-Enhanced Multi-Armed Bandits

Jiahang Sun, Zhiyong Wang, Runhan Yang, Chenjun Xiao, John C. S. Lui, Zhongxiang Dai

arXiv:2502.01118v114.45 citationsh-index: 7

Originality Incremental advance

AI Analysis

This work addresses the problem of improving sequential decision-making in MAB for researchers and practitioners by offering an incremental enhancement over existing LLM-based methods.

The paper tackles the suboptimal performance of using large language models (LLMs) for direct arm selection in multi-armed bandits (MAB) by proposing a hybrid approach that integrates LLMs as reward predictors within classical MAB algorithms like Thompson sampling, achieving consistent outperformance over baseline methods in synthetic and real-world text-based tasks.

Large language models (LLMs) have been adopted to solve sequential decision-making tasks such as multi-armed bandits (MAB), in which an LLM is directly instructed to select the arms to pull in every iteration. However, this paradigm of direct arm selection using LLMs has been shown to be suboptimal in many MAB tasks. Therefore, we propose an alternative approach which combines the strengths of classical MAB and LLMs. Specifically, we adopt a classical MAB algorithm as the high-level framework and leverage the strong in-context learning capability of LLMs to perform the sub-task of reward prediction. Firstly, we incorporate the LLM-based reward predictor into the classical Thompson sampling (TS) algorithm and adopt a decaying schedule for the LLM temperature to ensure a transition from exploration to exploitation. Next, we incorporate the LLM-based reward predictor (with a temperature of 0) into a regression oracle-based MAB algorithm equipped with an explicit exploration mechanism. We also extend our TS-based algorithm to dueling bandits where only the preference feedback between pairs of arms is available, which requires non-trivial algorithmic modifications. We conduct empirical evaluations using both synthetic MAB tasks and experiments designed using real-world text datasets, in which the results show that our algorithms consistently outperform previous baseline methods based on direct arm selection. Interestingly, we also demonstrate that in challenging tasks where the arms lack semantic meanings that can be exploited by the LLM, our approach achieves considerably better performance than LLM-based direct arm selection.

View on arXiv PDF

Similar