LGMLAug 6, 2018

Regret Bounds for Reinforcement Learning via Markov Chain Concentration

arXiv:1808.01813v248 citations
Originality Highly original
AI Analysis

This provides the first optimal regret bounds for general, non-episodic reinforcement learning, addressing a key theoretical gap for researchers in RL theory.

The paper tackles the problem of deriving regret bounds for reinforcement learning in non-episodic, uniformly ergodic Markov decision processes, achieving a regret bound of $ ilde{O}(\sqrt{t_{ m mix} SAT})$ after $T$ steps, which is optimal in dependence on all parameters.

We give a simple optimistic algorithm for which it is easy to derive regret bounds of $\tilde{O}(\sqrt{t_{\rm mix} SAT})$ after $T$ steps in uniformly ergodic Markov decision processes with $S$ states, $A$ actions, and mixing time parameter $t_{\rm mix}$. These bounds are the first regret bounds in the general, non-episodic setting with an optimal dependence on all given parameters. They could only be improved by using an alternative mixing time parameter.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes