LGOCMLDec 12, 2019

Provably Efficient Exploration in Policy Optimization

arXiv:1912.05830v4304 citations
AI Analysis

This provides a provably efficient algorithm for policy optimization with exploration, addressing a gap in theory for RL practitioners.

The paper tackles the lack of theoretical understanding for policy-based reinforcement learning by proposing OPPO, an optimistic variant of Proximal Policy Optimization, which achieves a regret bound of $ ilde{O}(\sqrt{d^2 H^3 T})$ in episodic Markov decision processes with linear function approximation.

While policy-based reinforcement learning (RL) achieves tremendous successes in practice, it is significantly less understood in theory, especially compared with value-based RL. In particular, it remains elusive how to design a provably efficient policy optimization algorithm that incorporates exploration. To bridge such a gap, this paper proposes an Optimistic variant of the Proximal Policy Optimization algorithm (OPPO), which follows an ``optimistic version'' of the policy gradient direction. This paper proves that, in the problem of episodic Markov decision process with linear function approximation, unknown transition, and adversarial reward with full-information feedback, OPPO achieves $\tilde{O}(\sqrt{d^2 H^3 T} )$ regret. Here $d$ is the feature dimension, $H$ is the episode horizon, and $T$ is the total number of steps. To the best of our knowledge, OPPO is the first provably efficient policy optimization algorithm that explores.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes