LGMLJun 11, 2020

TS-UCB: Improving on Thompson Sampling With Little to No Additional Computation

arXiv:2006.06372v210 citations
AI Analysis

This work offers a computationally efficient improvement for Thompson sampling in bandit problems, benefiting practitioners in areas like recommendation systems, but it is incremental as it builds on existing methods.

The paper tackles the problem of improving Thompson sampling for online decision problems with bandit feedback by proposing TS-UCB, an alternative arm selection rule that requires little additional computation. It achieves significantly lower regret on synthetic and real-world datasets, including a Yahoo! article recommendation dataset, and provides optimal theoretical regret guarantees.

Thompson sampling has become a ubiquitous approach to online decision problems with bandit feedback. The key algorithmic task for Thompson sampling is drawing a sample from the posterior of the optimal action. We propose an alternative arm selection rule we dub TS-UCB, that requires negligible additional computational effort but provides significant performance improvements relative to Thompson sampling. At each step, TS-UCB computes a score for each arm using two ingredients: posterior sample(s) and upper confidence bounds. TS-UCB can be used in any setting where these two quantities are available, and it is flexible in the number of posterior samples it takes as input. TS-UCB achieves materially lower regret on a comprehensive suite of synthetic and real-world datasets, including a personalized article recommendation dataset from Yahoo! and a suite of benchmark datasets from a deep bandit suite proposed in Riquelme et al. (2018). Finally, from a theoretical perspective, we establish optimal regret guarantees for TS-UCB for both the K-armed and linear bandit models.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes