LGAIMLFeb 25, 2016

Thompson Sampling is Asymptotically Optimal in General Environments

arXiv:1602.07905v239 citations
AI Analysis

This provides a theoretical guarantee for Thompson sampling in broad, challenging environments, which is incremental as it extends prior results to more general settings.

The paper tackles the problem of reinforcement learning in complex, general stochastic environments that can be non-Markov, non-ergodic, and partially observable, showing that a variant of Thompson sampling achieves asymptotic optimality with value convergence to the optimal value in mean and sublinear regret under a recoverability assumption.

We discuss a variant of Thompson sampling for nonparametric reinforcement learning in a countable classes of general stochastic environments. These environments can be non-Markov, non-ergodic, and partially observable. We show that Thompson sampling learns the environment class in the sense that (1) asymptotically its value converges to the optimal value in mean and (2) given a recoverability assumption regret is sublinear.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes