Horizon-Free Reinforcement Learning in Polynomial Time: the Power of Stationary Policies
This solves a fundamental problem in reinforcement learning theory by enabling efficient learning without horizon limitations, which is incremental but impactful for theoretical and practical applications in sequential decision-making.
The paper tackles the problem of achieving horizon-free regret bounds in tabular Markov Decision Processes, presenting the first polynomial-time algorithm that achieves an O(poly(S,A,log K)√K) regret without dependency on the planning horizon H, improving over prior bounds with polylog(H) or exponential dependencies on S.
This paper gives the first polynomial-time algorithm for tabular Markov Decision Processes (MDP) that enjoys a regret bound \emph{independent on the planning horizon}. Specifically, we consider tabular MDP with $S$ states, $A$ actions, a planning horizon $H$, total reward bounded by $1$, and the agent plays for $K$ episodes. We design an algorithm that achieves an $O\left(\mathrm{poly}(S,A,\log K)\sqrt{K}\right)$ regret in contrast to existing bounds which either has an additional $\mathrm{polylog}(H)$ dependency~\citep{zhang2020reinforcement} or has an exponential dependency on $S$~\citep{li2021settling}. Our result relies on a sequence of new structural lemmas establishing the approximation power, stability, and concentration property of stationary policies, which can have applications in other problems related to Markov chains.