LGMLFeb 19, 2021

Output-Weighted Sampling for Multi-Armed Bandits with Extreme Payoffs

arXiv:2102.10085v221 citations
AI Analysis

This addresses bandit optimization for scenarios with extreme payoffs, which is incremental as it builds on existing Gaussian process bandit methods.

The paper tackles the problem of online decision making in multi-armed and contextual bandit problems with extreme payoffs by proposing a novel upper confidence bound acquisition function that guides exploration based on reward variability. The method demonstrates benefits across synthetic benchmarks and a realistic sensor network example.

We present a new type of acquisition functions for online decision making in multi-armed and contextual bandit problems with extreme payoffs. Specifically, we model the payoff function as a Gaussian process and formulate a novel type of upper confidence bound (UCB) acquisition function that guides exploration towards the bandits that are deemed most relevant according to the variability of the observed rewards. This is achieved by computing a tractable likelihood ratio that quantifies the importance of the output relative to the inputs and essentially acts as an \textit{attention mechanism} that promotes exploration of extreme rewards. We demonstrate the benefits of the proposed methodology across several synthetic benchmarks, as well as a realistic example involving noisy sensor network data. Finally, we provide a JAX library for efficient bandit optimization using Gaussian processes.

Code Implementations1 repo
Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes