LGMar 24, 2022

Horizon-Free Reinforcement Learning in Polynomial Time: the Power of Stationary Policies

arXiv:2203.12922v231 citationsh-index: 49
AI Analysis

This solves a fundamental problem in reinforcement learning theory by enabling efficient learning without horizon limitations, which is incremental but impactful for theoretical and practical applications in sequential decision-making.

The paper tackles the problem of achieving horizon-free regret bounds in tabular Markov Decision Processes, presenting the first polynomial-time algorithm that achieves an O(poly(S,A,log K)√K) regret without dependency on the planning horizon H, improving over prior bounds with polylog(H) or exponential dependencies on S.

This paper gives the first polynomial-time algorithm for tabular Markov Decision Processes (MDP) that enjoys a regret bound \emph{independent on the planning horizon}. Specifically, we consider tabular MDP with $S$ states, $A$ actions, a planning horizon $H$, total reward bounded by $1$, and the agent plays for $K$ episodes. We design an algorithm that achieves an $O\left(\mathrm{poly}(S,A,\log K)\sqrt{K}\right)$ regret in contrast to existing bounds which either has an additional $\mathrm{polylog}(H)$ dependency~\citep{zhang2020reinforcement} or has an exponential dependency on $S$~\citep{li2021settling}. Our result relies on a sequence of new structural lemmas establishing the approximation power, stability, and concentration property of stationary policies, which can have applications in other problems related to Markov chains.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes