LGApr 3, 2017

On Kernelized Multi-armed Bandits

arXiv:1704.00445v2572 citations
AI Analysis

This work addresses the problem of optimizing unknown reward functions in continuous bandit settings for researchers and practitioners, representing an incremental improvement with new algorithms and bounds.

The paper tackles the stochastic bandit problem with a continuous set of arms by proposing two new Gaussian process-based algorithms, IGP-UCB and GP-TS, and derives regret bounds for them, with experimental results showing favorable gains in many cases.

We consider the stochastic bandit problem with a continuous set of arms, with the expected reward function over the arms assumed to be fixed but unknown. We provide two new Gaussian process-based algorithms for continuous bandit optimization-Improved GP-UCB (IGP-UCB) and GP-Thomson sampling (GP-TS), and derive corresponding regret bounds. Specifically, the bounds hold when the expected reward function belongs to the reproducing kernel Hilbert space (RKHS) that naturally corresponds to a Gaussian process kernel used as input by the algorithms. Along the way, we derive a new self-normalized concentration inequality for vector- valued martingales of arbitrary, possibly infinite, dimension. Finally, experimental evaluation and comparisons to existing algorithms on synthetic and real-world environments are carried out that highlight the favorable gains of the proposed strategies in many cases.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes