LGMLDec 21, 2021

Nearly Optimal Policy Optimization with Stable at Any Time Guarantee

arXiv:2112.10935v315 citations
Originality Highly original
AI Analysis

This addresses a foundational problem in RL theory by bridging the gap between existing policy optimization methods and information-theoretic lower bounds, with potential broad impact on algorithm design and analysis.

The paper tackles the theoretical gap in policy optimization methods for reinforcement learning by proposing a novel algorithm, RPO-SAT, which achieves a regret bound of ˜O(√(SAH^3K) + √(AH^4K)), making it nearly minimax optimal when S > H and the first computationally efficient, nearly optimal policy-based algorithm for tabular RL.

Policy optimization methods are one of the most widely used classes of Reinforcement Learning (RL) algorithms. However, theoretical understanding of these methods remains insufficient. Even in the episodic (time-inhomogeneous) tabular setting, the state-of-the-art theoretical result of policy-based method in \citet{shani2020optimistic} is only $\tilde{O}(\sqrt{S^2AH^4K})$ where $S$ is the number of states, $A$ is the number of actions, $H$ is the horizon, and $K$ is the number of episodes, and there is a $\sqrt{SH}$ gap compared with the information theoretic lower bound $\tildeΩ(\sqrt{SAH^3K})$. To bridge such a gap, we propose a novel algorithm Reference-based Policy Optimization with Stable at Any Time guarantee (\algnameacro), which features the property "Stable at Any Time". We prove that our algorithm achieves $\tilde{O}(\sqrt{SAH^3K} + \sqrt{AH^4K})$ regret. When $S > H$, our algorithm is minimax optimal when ignoring logarithmic factors. To our best knowledge, RPO-SAT is the first computationally efficient, nearly minimax optimal policy-based algorithm for tabular RL.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes