LGMLOct 11, 2019

Old Dog Learns New Tricks: Randomized UCB for Bandit Problems

arXiv:1910.04928v232 citations
Originality Highly original
AI Analysis

This work addresses exploration-exploitation trade-offs in bandit algorithms for machine learning, offering a novel hybrid approach with theoretical guarantees.

The paper tackles the bandit problem by proposing RandUCB, a randomized strategy that combines UCB and Thompson sampling, achieving minimax-optimal regret bounds of Õ(√KT) for multi-armed bandits and Õ(d√T) for structured bandits, and matches or outperforms existing methods in experiments.

We propose $\tt RandUCB$, a bandit strategy that builds on theoretically derived confidence intervals similar to upper confidence bound (UCB) algorithms, but akin to Thompson sampling (TS), it uses randomization to trade off exploration and exploitation. In the $K$-armed bandit setting, we show that there are infinitely many variants of $\tt RandUCB$, all of which achieve the minimax-optimal $\widetilde{O}(\sqrt{K T})$ regret after $T$ rounds. Moreover, for a specific multi-armed bandit setting, we show that both UCB and TS can be recovered as special cases of $\tt RandUCB$. For structured bandits, where each arm is associated with a $d$-dimensional feature vector and rewards are distributed according to a linear or generalized linear model, we prove that $\tt RandUCB$ achieves the minimax-optimal $\widetilde{O}(d \sqrt{T})$ regret even in the case of infinitely many arms. Through experiments in both the multi-armed and structured bandit settings, we demonstrate that $\tt RandUCB$ matches or outperforms TS and other randomized exploration strategies. Our theoretical and empirical results together imply that $\tt RandUCB$ achieves the best of both worlds.

Code Implementations1 repo
Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes