LGOCMLJun 25, 2019

Neural Proximal/Trust Region Policy Optimization Attains Globally Optimal Policy

arXiv:1906.10306v3114 citations
Originality Incremental advance
AI Analysis

This provides a theoretical foundation for widely used deep RL algorithms, addressing a gap between theory and practice, though it is incremental as it builds on existing methods with modifications.

The paper tackles the lack of theoretical understanding of global convergence for PPO and TRPO in deep reinforcement learning due to nonconvexity, and proves that a variant with overparametrized neural networks converges to the globally optimal policy at a sublinear rate.

Proximal policy optimization and trust region policy optimization (PPO and TRPO) with actor and critic parametrized by neural networks achieve significant empirical success in deep reinforcement learning. However, due to nonconvexity, the global convergence of PPO and TRPO remains less understood, which separates theory from practice. In this paper, we prove that a variant of PPO and TRPO equipped with overparametrized neural networks converges to the globally optimal policy at a sublinear rate. The key to our analysis is the global convergence of infinite-dimensional mirror descent under a notion of one-point monotonicity, where the gradient and iterate are instantiated by neural networks. In particular, the desirable representation power and optimization geometry induced by the overparametrization of such neural networks allow them to accurately approximate the infinite-dimensional gradient and iterate.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes