LG MLApr 28, 2023

Kullback-Leibler Maillard Sampling for Multi-armed Bandits with Bounded Rewards

arXiv:2304.14989v45.33 citationsh-index: 17Has Code

Originality Incremental advance

AI Analysis

This work addresses a specific problem in multi-armed bandit theory for researchers and practitioners, offering an incremental improvement over existing methods.

The paper tackles the challenge of designing regret-efficient randomized exploration algorithms for K-armed bandits with bounded rewards by proposing the Kullback-Leibler Maillard Sampling (KL-MS) algorithm, which achieves an asymptotic optimality for Bernoulli rewards and a worst-case regret bound of O(√(μ*(1-μ*)KT ln K) + K ln T).

We study $K$-armed bandit problems where the reward distributions of the arms are all supported on the $[0,1]$ interval. It has been a challenge to design regret-efficient randomized exploration algorithms in this setting. Maillard sampling \cite{maillard13apprentissage}, an attractive alternative to Thompson sampling, has recently been shown to achieve competitive regret guarantees in the sub-Gaussian reward setting \cite{bian2022maillard} while maintaining closed-form action probabilities, which is useful for offline policy evaluation. In this work, we propose the Kullback-Leibler Maillard Sampling (KL-MS) algorithm, a natural extension of Maillard sampling for achieving KL-style gap-dependent regret bound. We show that KL-MS enjoys the asymptotic optimality when the rewards are Bernoulli and has a worst-case regret bound of the form $O(\sqrt{μ^*(1-μ^*) K T \ln K} + K \ln T)$, where $μ^*$ is the expected reward of the optimal arm, and $T$ is the time horizon length.

View on arXiv PDF Code

Similar