Garbage In, Reward Out: Bootstrapping Exploration in Multi-Armed Bandits
This addresses the exploration challenge in bandit algorithms for decision-making systems, offering a method that generalizes to structured problems, though it is incremental as it builds on existing bootstrapping techniques.
The paper tackles the exploration problem in multi-armed bandits by proposing Giro, an algorithm that randomizes reward history with pseudo rewards for optimistic bootstrapping, achieving an O(K Δ^{-1} log n) regret bound in Bernoulli bandits and showing good performance in synthetic and real-world evaluations.
We propose a bandit algorithm that explores by randomizing its history of rewards. Specifically, it pulls the arm with the highest mean reward in a non-parametric bootstrap sample of its history with pseudo rewards. We design the pseudo rewards such that the bootstrap mean is optimistic with a sufficiently high probability. We call our algorithm Giro, which stands for garbage in, reward out. We analyze Giro in a Bernoulli bandit and derive a $O(K Δ^{-1} \log n)$ bound on its $n$-round regret, where $Δ$ is the difference in the expected rewards of the optimal and the best suboptimal arms, and $K$ is the number of arms. The main advantage of our exploration design is that it easily generalizes to structured problems. To show this, we propose contextual Giro with an arbitrary reward generalization model. We evaluate Giro and its contextual variant on multiple synthetic and real-world problems, and observe that it performs well.