Achieving adaptivity and optimality for multi-armed bandits using Exponential-Kullback Leibler Maillard Sampling
This work solves a theoretical gap in multi-armed bandit algorithms for researchers and practitioners by providing a unified solution that meets multiple optimality criteria, though it is incremental in combining existing ideas.
The paper tackles the problem of K-armed bandits with exponential family reward distributions by designing the Exp-KL-MS algorithm, which simultaneously achieves asymptotic optimality, minimax optimality with a sqrt(ln(K)) factor, Sub-UCB, and variance-adaptive worst-case regret bounds, addressing a gap where no prior algorithm met all these criteria.
We study the problem of $K$-armed bandits with reward distributions belonging to a one-parameter exponential distribution family. In the literature, several criteria have been proposed to evaluate the performance of such algorithms, including Asymptotic Optimality, Minimax Optimality, Sub-UCB, and variance-adaptive worst-case regret bound. Thompson Sampling-based and Upper Confidence Bound-based algorithms have been employed to achieve some of these criteria. However, none of these algorithms simultaneously satisfy all the aforementioned criteria. In this paper, we design an algorithm, Exponential Kullback-Leibler Maillard Sampling (abbrev. Exp-KL-MS), that can achieve multiple optimality criteria simultaneously, including Asymptotic Optimality, Minimax Optimality with a $\sqrt{\ln (K)}$ factor, Sub-UCB, and variance-adaptive worst-case regret bound.