LGMLFeb 26, 2019

Perturbed-History Exploration in Stochastic Multi-Armed Bandits

arXiv:1902.10089v235 citations
AI Analysis

This work addresses a fundamental challenge in online decision-making for multi-armed bandits, offering an incremental improvement over existing methods.

The paper tackles the problem of cumulative regret minimization in stochastic multi-armed bandits by proposing a perturbed-history exploration (PHE) algorithm that adds pseudo-rewards to offset underestimated mean rewards, achieving near-optimal regret bounds and competitive empirical performance with state-of-the-art baselines.

We propose an online algorithm for cumulative regret minimization in a stochastic multi-armed bandit. The algorithm adds $O(t)$ i.i.d. pseudo-rewards to its history in round $t$ and then pulls the arm with the highest average reward in its perturbed history. Therefore, we call it perturbed-history exploration (PHE). The pseudo-rewards are carefully designed to offset potentially underestimated mean rewards of arms with a high probability. We derive near-optimal gap-dependent and gap-free bounds on the $n$-round regret of PHE. The key step in our analysis is a novel argument that shows that randomized Bernoulli rewards lead to optimism. Finally, we empirically evaluate PHE and show that it is competitive with state-of-the-art baselines.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes