Asymptotic Convergence of Thompson Sampling
This work addresses the theoretical understanding of Thompson sampling's long-term convergence for researchers in online learning and bandit problems, but it is incremental as it builds on prior finite-time analyses.
The paper tackles the underexplored asymptotic behavior of Thompson sampling by proving that under a sub-linear Bayesian regret assumption, the agent's actions provide a strongly consistent estimator of the optimal action, relying on the martingale structure of the method.
Thompson sampling has been shown to be an effective policy across a variety of online learning tasks. Many works have analyzed the finite time performance of Thompson sampling, and proved that it achieves a sub-linear regret under a broad range of probabilistic settings. However its asymptotic behavior remains mostly underexplored. In this paper, we prove an asymptotic convergence result for Thompson sampling under the assumption of a sub-linear Bayesian regret, and show that the actions of a Thompson sampling agent provide a strongly consistent estimator of the optimal action. Our results rely on the martingale structure inherent in Thompson sampling.