LGMLJun 11, 2020

Adaptive Reward-Free Exploration

arXiv:2006.06294v294 citations
AI Analysis

This work addresses sample efficiency in reinforcement learning for scenarios where rewards are unknown, offering incremental improvements in theoretical bounds.

The paper tackles reward-free exploration in reinforcement learning by proposing an adaptive algorithm, RF-UCRL, that reduces MDP estimation error, achieving a sample complexity of order (SAH^4/ε^2)(log(1/δ) + S) episodes to output an ε-approximation of the optimal policy with probability 1-δ, improving over existing bounds in small ε and δ regimes.

Reward-free exploration is a reinforcement learning setting studied by Jin et al. (2020), who address it by running several algorithms with regret guarantees in parallel. In our work, we instead give a more natural adaptive approach for reward-free exploration which directly reduces upper bounds on the maximum MDP estimation error. We show that, interestingly, our reward-free UCRL algorithm can be seen as a variant of an algorithm of Fiechter from 1994, originally proposed for a different objective that we call best-policy identification. We prove that RF-UCRL needs of order $({SAH^4}/{\varepsilon^2})(\log(1/δ) + S)$ episodes to output, with probability $1-δ$, an $\varepsilon$-approximation of the optimal policy for any reward function. This bound improves over existing sample-complexity bounds in both the small $\varepsilon$ and the small $δ$ regimes. We further investigate the relative complexities of reward-free exploration and best-policy identification.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes