LG DS MLOct 20, 2016

Combinatorial Multi-Armed Bandit with General Reward Functions

Wei Chen, Wei Hu, Fu Li, Jian Li, Yu Liu, Pinyan Lu

arXiv:1610.06603v421.9150 citations

Originality Highly original

AI Analysis

This work addresses a limitation in bandit algorithms for handling complex reward functions, offering theoretical guarantees for applications like optimization and decision-making, though it is incremental in extending existing frameworks.

The paper tackles the stochastic combinatorial multi-armed bandit problem with general nonlinear reward functions, such as max and utility functions, by proposing the SDCB algorithm, which achieves O(log T) distribution-dependent regret and O(√T) distribution-independent regret, and provides the first PTAS for the offline K-MAX problem and O(√T) regret for its online version.

In this paper, we study the stochastic combinatorial multi-armed bandit (CMAB) framework that allows a general nonlinear reward function, whose expected value may not depend only on the means of the input random variables but possibly on the entire distributions of these variables. Our framework enables a much larger class of reward functions such as the $\max()$ function and nonlinear utility functions. Existing techniques relying on accurate estimations of the means of random variables, such as the upper confidence bound (UCB) technique, do not work directly on these functions. We propose a new algorithm called stochastically dominant confidence bound (SDCB), which estimates the distributions of underlying random variables and their stochastically dominant confidence bounds. We prove that SDCB can achieve $O(\log{T})$ distribution-dependent regret and $\tilde{O}(\sqrt{T})$ distribution-independent regret, where $T$ is the time horizon. We apply our results to the $K$-MAX problem and expected utility maximization problems. In particular, for $K$-MAX, we provide the first polynomial-time approximation scheme (PTAS) for its offline problem, and give the first $\tilde{O}(\sqrt T)$ bound on the $(1-ε)$-approximation regret of its online problem, for any $ε>0$.

View on arXiv PDF

Similar