LG MLJun 12, 2019

Regret Minimization for Reinforcement Learning by Evaluating the Optimal Bias Function

arXiv:1906.05110v321.778 citations

Originality Highly original

AI Analysis

This provides a near-optimal algorithm for RL regret minimization, which is incremental but important for theoretical RL research.

The paper tackles the problem of regret minimization in reinforcement learning for Markov decision processes with finite state-action spaces, achieving a regret bound of ˜O(√(SAHT)) which improves the previous best by a factor of √S and matches the lower bound up to logarithmic factors.

We present an algorithm based on the \emph{Optimism in the Face of Uncertainty} (OFU) principle which is able to learn Reinforcement Learning (RL) modeled by Markov decision process (MDP) with finite state-action space efficiently. By evaluating the state-pair difference of the optimal bias function $h^{*}$, the proposed algorithm achieves a regret bound of $\tilde{O}(\sqrt{SAHT})$\footnote{The symbol $\tilde{O}$ means $O$ with log factors ignored. } for MDP with $S$ states and $A$ actions, in the case that an upper bound $H$ on the span of $h^{*}$, i.e., $sp(h^{*})$ is known. This result outperforms the best previous regret bounds $\tilde{O}(S\sqrt{AHT}) $\citep{fruit2019improved} by a factor of $\sqrt{S}$. Furthermore, this regret bound matches the lower bound of $Ω(\sqrt{SAHT}) $\citep{jaksch2010near} up to a logarithmic factor. As a consequence, we show that there is a near optimal regret bound of $\tilde{O}(\sqrt{SADT})$ for MDPs with a finite diameter $D$ compared to the lower bound of $Ω(\sqrt{SADT}) $\citep{jaksch2010near}.

View on arXiv PDF

Similar