LGAug 11, 2021

Gap-Dependent Unsupervised Exploration for Reinforcement Learning

Jingfeng Wu, Vladimir Braverman, Lin F. Yang

arXiv:2108.05439v29.213 citations

Originality Highly original

AI Analysis

This work addresses the challenge of overly pessimistic sample bounds in unsupervised RL for researchers and practitioners, offering a more efficient approach when reward sub-optimality gaps are present, though it is incremental as it builds on existing worst-case methods by incorporating structural information.

The paper tackles the problem of task-agnostic reinforcement learning by developing an efficient algorithm that uses a gap parameter to reduce exploration, achieving a sample complexity of $\widetilde{\mathcal{O}}(1/\varepsilon \cdot (H^3SA / ho + H^4 S^2 A))$ episodes for obtaining an $\varepsilon$-optimal policy, which is nearly quadratic saving in terms of $\varepsilon$ compared to the worst-case bound of $\widetilde{\mathcal{O}}(1/\varepsilon^2)$.

For the problem of task-agnostic reinforcement learning (RL), an agent first collects samples from an unknown environment without the supervision of reward signals, then is revealed with a reward and is asked to compute a corresponding near-optimal policy. Existing approaches mainly concern the worst-case scenarios, in which no structural information of the reward/transition-dynamics is utilized. Therefore the best sample upper bound is $\propto\widetilde{\mathcal{O}}(1/ε^2)$, where $ε>0$ is the target accuracy of the obtained policy, and can be overly pessimistic. To tackle this issue, we provide an efficient algorithm that utilizes a gap parameter, $ρ>0$, to reduce the amount of exploration. In particular, for an unknown finite-horizon Markov decision process, the algorithm takes only $\widetilde{\mathcal{O}} (1/ε\cdot (H^3SA / ρ+ H^4 S^2 A) )$ episodes of exploration, and is able to obtain an $ε$-optimal policy for a post-revealed reward with sub-optimality gap at least $ρ$, where $S$ is the number of states, $A$ is the number of actions, and $H$ is the length of the horizon, obtaining a nearly \emph{quadratic saving} in terms of $ε$. We show that, information-theoretically, this bound is nearly tight for $ρ< Θ(1/(HS))$ and $H>1$. We further show that $\propto\widetilde{\mathcal{O}}(1)$ sample bound is possible for $H=1$ (i.e., multi-armed bandit) or with a sampling simulator, establishing a stark separation between those settings and the RL setting.

View on arXiv PDF

Similar