LGAIOCJan 30, 2021

Policy Mirror Descent for Reinforcement Learning: Linear Convergence, New Sampling Complexity, and Generalized Problem Classes

arXiv:2102.00135v6184 citations
Originality Highly original
AI Analysis

This work provides faster and more flexible algorithms for reinforcement learning practitioners, though it is incremental as it builds on existing mirror descent techniques.

The paper tackles reinforcement learning problems by introducing policy mirror descent methods with convex regularizers, achieving linear convergence and new sampling complexities of O(1/ε) for strongly convex and O(1/ε²) for general convex cases.

We present new policy mirror descent (PMD) methods for solving reinforcement learning (RL) problems with either strongly convex or general convex regularizers. By exploring the structural properties of these overall highly nonconvex problems we show that the PMD methods exhibit fast linear rate of convergence to the global optimality. We develop stochastic counterparts of these methods, and establish an ${\cal O}(1/ε)$ (resp., ${\cal O}(1/ε^2)$) sampling complexity for solving these RL problems with strongly (resp., general) convex regularizers using different sampling schemes, where $ε$ denote the target accuracy. We further show that the complexity for computing the gradients of these regularizers, if necessary, can be bounded by ${\cal O}\{(\log_γε) [(1-γ)L/μ]^{1/2}\log (1/ε)\}$ (resp., ${\cal O} \{(\log_γε) (L/ε)^{1/2}\}$)for problems with strongly (resp., general) convex regularizers. Here $γ$ denotes the discounting factor. To the best of our knowledge, these complexity bounds, along with our algorithmic developments, appear to be new in both optimization and RL literature. The introduction of these convex regularizers also greatly expands the flexibility and applicability of RL models.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes