Multi-thresholding Good Arm Identification with Bandit Feedback
This work addresses a multi-objective decision-making problem for bandit algorithms, but it is incremental as it extends existing single-objective methods to multiple thresholds.
The paper tackles the problem of identifying a good arm in a multi-objective stochastic bandit setting, where arms have reward vectors, and proposes the MultiTUCB algorithm with a sample complexity bound that matches existing results in special cases and shows superior performance on synthetic and real datasets.
We consider a good arm identification problem in a stochastic bandit setting with multi-objectives, where each arm $i \in [K]$ is associated with a distribution $D_i$ defined over $R^M$. For each round $t$, the player pulls an arm $i_t$ and receives an $M$-dimensional reward vector sampled according to $D_{i_t}$. The goal is to find, with high probability, an $ε$-good arm whose expected reward vector is larger than $\bmξ - ε\mathbf{1}$, where $\bmξ$ is a predefined threshold vector, and the vector comparison is component-wise. We propose the Multi-Thresholding UCB~(MultiTUCB) algorithm with a sample complexity bound. Our bound matches the existing one in the special case where $M=1$ and $ε=0$. The proposed algorithm demonstrates superior performance compared to baseline approaches across synthetic and real datasets.