LG MLJun 11, 2020

Adaptive Reward-Free Exploration

Emilie Kaufmann, Pierre Ménard, Omar Darwiche Domingues, Anders Jonsson, Edouard Leurent, Michal Valko

arXiv:2006.06294v226.096 citations

Originality Incremental advance

AI Analysis

This work addresses sample efficiency in reinforcement learning for scenarios where rewards are unknown, offering incremental improvements in theoretical bounds.

The paper tackles reward-free exploration in reinforcement learning by proposing an adaptive algorithm, RF-UCRL, that reduces MDP estimation error, achieving a sample complexity of order (SAH^4/ε^2)(log(1/δ) + S) episodes to output an ε-approximation of the optimal policy with probability 1-δ, improving over existing bounds in small ε and δ regimes.

Reward-free exploration is a reinforcement learning setting studied by Jin et al. (2020), who address it by running several algorithms with regret guarantees in parallel. In our work, we instead give a more natural adaptive approach for reward-free exploration which directly reduces upper bounds on the maximum MDP estimation error. We show that, interestingly, our reward-free UCRL algorithm can be seen as a variant of an algorithm of Fiechter from 1994, originally proposed for a different objective that we call best-policy identification. We prove that RF-UCRL needs of order $({SAH^4}/{\varepsilon^2})(\log(1/δ) + S)$ episodes to output, with probability $1-δ$, an $\varepsilon$-approximation of the optimal policy for any reward function. This bound improves over existing sample-complexity bounds in both the small $\varepsilon$ and the small $δ$ regimes. We further investigate the relative complexities of reward-free exploration and best-policy identification.

View on arXiv PDF

Similar