ML LGJun 4, 2013

(More) Efficient Reinforcement Learning via Posterior Sampling

Ian Osband, Daniel Russo, Benjamin Van Roy

arXiv:1306.0940v5592 citations

Originality Incremental advance

AI Analysis

This work addresses the challenge of exploration in reinforcement learning for researchers and practitioners, offering a computationally efficient and conceptually simple method that is competitive but incremental compared to optimism-based approaches.

The paper tackles the problem of efficient exploration in reinforcement learning by proposing posterior sampling for reinforcement learning (PSRL), an alternative to optimistic algorithms, and establishes an expected regret bound of $ ilde{O}( au S \sqrt{AT})$, which is close to state-of-the-art and outperforms existing algorithms in simulations.

Most provably-efficient learning algorithms introduce optimism about poorly-understood states and actions to encourage exploration. We study an alternative approach for efficient exploration, posterior sampling for reinforcement learning (PSRL). This algorithm proceeds in repeated episodes of known duration. At the start of each episode, PSRL updates a prior distribution over Markov decision processes and takes one sample from this posterior. PSRL then follows the policy that is optimal for this sample during the episode. The algorithm is conceptually simple, computationally efficient and allows an agent to encode prior knowledge in a natural way. We establish an $\tilde{O}(τS \sqrt{AT})$ bound on the expected regret, where $T$ is time, $τ$ is the episode length and $S$ and $A$ are the cardinalities of the state and action spaces. This bound is one of the first for an algorithm not based on optimism, and close to the state of the art for any reinforcement learning algorithm. We show through simulation that PSRL significantly outperforms existing algorithms with similar regret bounds.

View on arXiv PDF

Similar