LGMay 19, 2017

Posterior sampling for reinforcement learning: worst-case regret bounds

arXiv:1705.07041v313.238 citations

Originality Incremental advance

AI Analysis

This work provides a theoretical guarantee for posterior sampling in RL, addressing worst-case regret for researchers and practitioners, though it is incremental as it builds on existing bounds.

The paper tackles the problem of reinforcement learning in communicating Markov Decision Processes by developing a posterior sampling algorithm that achieves near-optimal worst-case regret bounds, with a high probability upper bound of $ ilde{O}(DS\sqrt{AT})$ that closely matches the known lower bound of $\Omega(\sqrt{DSAT})$.

We present an algorithm based on posterior sampling (aka Thompson sampling) that achieves near-optimal worst-case regret bounds when the underlying Markov Decision Process (MDP) is communicating with a finite, though unknown, diameter. Our main result is a high probability regret upper bound of $\tilde{O}(DS\sqrt{AT})$ for any communicating MDP with $S$ states, $A$ actions and diameter $D$. Here, regret compares the total reward achieved by the algorithm to the total expected reward of an optimal infinite-horizon undiscounted average reward policy, in time horizon $T$. This result closely matches the known lower bound of $Ω(\sqrt{DSAT})$. Our techniques involve proving some novel results about the anti-concentration of Dirichlet distribution, which may be of independent interest.

View on arXiv PDF

Similar