Optimal Regret Analysis of Thompson Sampling in Stochastic Multi-armed Bandit Problem with Multiple Plays
This work addresses the problem of efficiently selecting multiple arms per round in bandit settings for applications like online advertising, though it is incremental as it extends an existing method to a new variant.
The authors tackled the multiple-play multi-armed bandit problem by proposing the MP-TS algorithm, an extension of Thompson sampling, and proved it achieves the optimal regret upper bound that matches the known lower bound for binary rewards, making it the first computationally efficient algorithm with this property.
We discuss a multiple-play multi-armed bandit (MAB) problem in which several arms are selected at each round. Recently, Thompson sampling (TS), a randomized algorithm with a Bayesian spirit, has attracted much attention for its empirically excellent performance, and it is revealed to have an optimal regret bound in the standard single-play MAB problem. In this paper, we propose the multiple-play Thompson sampling (MP-TS) algorithm, an extension of TS to the multiple-play MAB problem, and discuss its regret analysis. We prove that MP-TS for binary rewards has the optimal regret upper bound that matches the regret lower bound provided by Anantharam et al. (1987). Therefore, MP-TS is the first computationally efficient algorithm with optimal regret. A set of computer simulations was also conducted, which compared MP-TS with state-of-the-art algorithms. We also propose a modification of MP-TS, which is shown to have better empirical performance.