LGMay 1, 2015

Thompson Sampling for Budgeted Multi-armed Bandits

arXiv:1505.00146v181 citations
Originality Incremental advance
AI Analysis

This work addresses budget-limited decision-making in sequential resource allocation, representing an incremental extension of Thompson sampling to a more complex setting.

The paper tackles the problem of multi-armed bandits with budget constraints by extending Thompson sampling to handle random costs, proving a regret bound of O(ln B) for Bernoulli distributions and generalizing it to other distributions with similar performance.

Thompson sampling is one of the earliest randomized algorithms for multi-armed bandits (MAB). In this paper, we extend the Thompson sampling to Budgeted MAB, where there is random cost for pulling an arm and the total cost is constrained by a budget. We start with the case of Bernoulli bandits, in which the random rewards (costs) of an arm are independently sampled from a Bernoulli distribution. To implement the Thompson sampling algorithm in this case, at each round, we sample two numbers from the posterior distributions of the reward and cost for each arm, obtain their ratio, select the arm with the maximum ratio, and then update the posterior distributions. We prove that the distribution-dependent regret bound of this algorithm is $O(\ln B)$, where $B$ denotes the budget. By introducing a Bernoulli trial, we further extend this algorithm to the setting that the rewards (costs) are drawn from general distributions, and prove that its regret bound remains almost the same. Our simulation results demonstrate the effectiveness of the proposed algorithm.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes