MLLGMay 18, 2012

Thompson Sampling: An Asymptotically Optimal Finite Time Analysis

arXiv:1205.4217v2616 citations
Originality Highly original
AI Analysis

This resolves a foundational theoretical gap in bandit algorithms, impacting researchers and practitioners in reinforcement learning and decision-making.

The paper tackled the long-standing open problem of Thompson Sampling's optimality for stochastic multi-armed bandits with Bernoulli rewards, providing the first finite-time analysis that matches the asymptotic lower bound for cumulative regret, with numerical experiments comparing it to other optimal policies.

The question of the optimality of Thompson Sampling for solving the stochastic multi-armed bandit problem had been open since 1933. In this paper we answer it positively for the case of Bernoulli rewards by providing the first finite-time analysis that matches the asymptotic rate given in the Lai and Robbins lower bound for the cumulative regret. The proof is accompanied by a numerical comparison with other optimal policies, experiments that have been lacking in the literature until now for the Bernoulli case.

Code Implementations1 repo
Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes