LG AIJun 15, 2021

Thompson Sampling for Unimodal Bandits

Long Yang, Zhao Li, Zehong Hu, Shasha Ruan, Shijian Li, Gang Pan, Hongyang Chen

arXiv:2106.08187v21.6

Originality Incremental advance

AI Analysis

This work addresses the challenge of efficient exploration in bandit problems with unimodal reward structures, offering significant performance gains over standard methods, though it is incremental in adapting Thompson Sampling to a specific structural assumption.

The authors tackled the problem of improving Thompson Sampling for unimodal bandits by proposing an algorithm that restricts exploration to neighborhoods of the best empirical arm, achieving asymptotically optimal regret for Bernoulli rewards and O(log T) regret for Gaussian rewards.

In this paper, we propose a Thompson Sampling algorithm for \emph{unimodal} bandits, where the expected reward is unimodal over the partially ordered arms. To exploit the unimodal structure better, at each step, instead of exploration from the entire decision space, our algorithm makes decision according to posterior distribution only in the neighborhood of the arm that has the highest empirical mean estimate. We theoretically prove that, for Bernoulli rewards, the regret of our algorithm reaches the lower bound of unimodal bandits, thus it is asymptotically optimal. For Gaussian rewards, the regret of our algorithm is $\mathcal{O}(\log T)$, which is far better than standard Thompson Sampling algorithms. Extensive experiments demonstrate the effectiveness of the proposed algorithm on both synthetic data sets and the real-world applications.

View on arXiv PDF

Similar