LG AI MLFeb 25, 2016

Thompson Sampling is Asymptotically Optimal in General Environments

Jan Leike, Tor Lattimore, Laurent Orseau, Marcus Hutter

arXiv:1602.07905v212.139 citationsh-index: 33

Originality Incremental advance

AI Analysis

This provides a theoretical guarantee for Thompson sampling in broad, challenging environments, which is incremental as it extends prior results to more general settings.

The paper tackles the problem of reinforcement learning in complex, general stochastic environments that can be non-Markov, non-ergodic, and partially observable, showing that a variant of Thompson sampling achieves asymptotic optimality with value convergence to the optimal value in mean and sublinear regret under a recoverability assumption.

We discuss a variant of Thompson sampling for nonparametric reinforcement learning in a countable classes of general stochastic environments. These environments can be non-Markov, non-ergodic, and partially observable. We show that Thompson sampling learns the environment class in the sense that (1) asymptotically its value converges to the optimal value in mean and (2) given a recoverability assumption regret is sublinear.

View on arXiv PDF

Similar