Batched Thompson Sampling for Multi-Armed Bandits
This work addresses the challenge of efficient decision-making in sequential learning for applications like online advertising or clinical trials, but it is incremental as it builds on existing Thompson Sampling methods.
The paper tackles the problem of minimizing regret in stochastic multi-armed bandits using a batched setting with limited policy changes, proposing two algorithms that achieve almost tight regret-batches tradeoffs for the two-arm case.
We study Thompson Sampling algorithms for stochastic multi-armed bandits in the batched setting, in which we want to minimize the regret over a sequence of arm pulls using a small number of policy changes (or, batches). We propose two algorithms and demonstrate their effectiveness by experiments on both synthetic and real datasets. We also analyze the proposed algorithms from the theoretical aspect and obtain almost tight regret-batches tradeoffs for the two-arm case.