ML LGMay 18, 2012

Thompson Sampling: An Asymptotically Optimal Finite Time Analysis

Emilie Kaufmann, Nathaniel Korda, Rémi Munos

arXiv:1205.4217v2616 citations

Originality Highly original

AI Analysis

This resolves a foundational theoretical gap in bandit algorithms, impacting researchers and practitioners in reinforcement learning and decision-making.

The paper tackled the long-standing open problem of Thompson Sampling's optimality for stochastic multi-armed bandits with Bernoulli rewards, providing the first finite-time analysis that matches the asymptotic lower bound for cumulative regret, with numerical experiments comparing it to other optimal policies.

The question of the optimality of Thompson Sampling for solving the stochastic multi-armed bandit problem had been open since 1933. In this paper we answer it positively for the case of Bernoulli rewards by providing the first finite-time analysis that matches the asymptotic rate given in the Lai and Robbins lower bound for the cumulative regret. The proof is accompanied by a numerical comparison with other optimal policies, experiments that have been lacking in the literature until now for the Bernoulli case.

View on arXiv PDF

Similar