LGMar 29, 2022

On Kernelized Multi-Armed Bandits with Constraints

arXiv:2203.15589v118.142 citationsh-index: 17

Originality Incremental advance

AI Analysis

This work addresses kernelized multi-armed bandits with soft constraints, offering a novel framework for improved trade-offs in practical applications, though it is incremental in extending existing methods to a more general setting.

The paper tackles the problem of stochastic bandits with unknown non-linear reward and constraint functions in a kernelized setting, introducing a primal-dual optimization framework that achieves sublinear regret and constraint violations, and demonstrates superior performance in experiments.

We study a stochastic bandit problem with a general unknown reward function and a general unknown constraint function. Both functions can be non-linear (even non-convex) and are assumed to lie in a reproducing kernel Hilbert space (RKHS) with a bounded norm. This kernelized bandit setup strictly generalizes standard multi-armed bandits and linear bandits. In contrast to safety-type hard constraints studied in prior works, we consider soft constraints that may be violated in any round as long as the cumulative violations are small, which is motivated by various practical applications. Our ultimate goal is to study how to utilize the nature of soft constraints to attain a finer complexity-regret-constraint trade-off in the kernelized bandit setting. To this end, leveraging primal-dual optimization, we propose a general framework for both algorithm design and performance analysis. This framework builds upon a novel sufficient condition, which not only is satisfied under general exploration strategies, including \emph{upper confidence bound} (UCB), \emph{Thompson sampling} (TS), and new ones based on \emph{random exploration}, but also enables a unified analysis for showing both sublinear regret and sublinear or even zero constraint violation. We demonstrate the superior performance of our proposed algorithms via numerical experiments based on both synthetic and real-world datasets. Along the way, we also make the first detailed comparison between two popular methods for analyzing constrained bandits and Markov decision processes (MDPs) by discussing the key difference and some subtleties in the analysis, which could be of independent interest to the communities.

View on arXiv PDF

Similar