LG OC MLOct 1, 2020

Nearly Minimax Optimal Reinforcement Learning for Discounted MDPs

arXiv:2010.00587v320.251 citations

Originality Highly original

AI Analysis

This provides a nearly optimal algorithm for reinforcement learning in discounted MDPs, addressing a core theoretical problem in the field.

The paper tackles the reinforcement learning problem for discounted Markov Decision Processes by proposing a model-based algorithm, UCBVI-γ, which achieves an Õ(√SAT/(1-γ)^1.5) regret, and proves a matching lower bound, showing it is nearly minimax optimal.

We study the reinforcement learning problem for discounted Markov Decision Processes (MDPs) under the tabular setting. We propose a model-based algorithm named UCBVI-$γ$, which is based on the \emph{optimism in the face of uncertainty principle} and the Bernstein-type bonus. We show that UCBVI-$γ$ achieves an $\tilde{O}\big({\sqrt{SAT}}/{(1-γ)^{1.5}}\big)$ regret, where $S$ is the number of states, $A$ is the number of actions, $γ$ is the discount factor and $T$ is the number of steps. In addition, we construct a class of hard MDPs and show that for any algorithm, the expected regret is at least $\tildeΩ\big({\sqrt{SAT}}/{(1-γ)^{1.5}}\big)$. Our upper bound matches the minimax lower bound up to logarithmic factors, which suggests that UCBVI-$γ$ is nearly minimax optimal for discounted MDPs.

View on arXiv PDF

Similar