LG AI MLMay 27, 2019

Tight Regret Bounds for Model-Based Reinforcement Learning with Greedy Policies

Yonathan Efroni, Nadav Merlis, Mohammad Ghavamzadeh, Shie Mannor

arXiv:1905.11527v221.774 citationsHas Code

Originality Incremental advance

AI Analysis

This provides a more efficient algorithm for RL practitioners by eliminating the need for full-planning in finite-state finite-horizon MDPs, though it is incremental as it builds on existing methods.

The paper tackles the computational inefficiency of full-planning in model-based reinforcement learning by showing that greedy policies using 1-step planning achieve tight minimax regret of $ ilde{\mathcal{O}}(\sqrt{HSAT})$ without performance degradation, reducing computational complexity by a factor of $S$.

State-of-the-art efficient model-based Reinforcement Learning (RL) algorithms typically act by iteratively solving empirical models, i.e., by performing \emph{full-planning} on Markov Decision Processes (MDPs) built by the gathered experience. In this paper, we focus on model-based RL in the finite-state finite-horizon MDP setting and establish that exploring with \emph{greedy policies} -- act by \emph{1-step planning} -- can achieve tight minimax performance in terms of regret, $\tilde{\mathcal{O}}(\sqrt{HSAT})$. Thus, full-planning in model-based RL can be avoided altogether without any performance degradation, and, by doing so, the computational complexity decreases by a factor of $S$. The results are based on a novel analysis of real-time dynamic programming, then extended to model-based RL. Specifically, we generalize existing algorithms that perform full-planning to such that act by 1-step planning. For these generalizations, we prove regret bounds with the same rate as their full-planning counterparts.

View on arXiv PDF Code

Similar