LGMLJun 25, 2019

Policy Optimization with Stochastic Mirror Descent

arXiv:1906.10462v538 citations
Originality Highly original
AI Analysis

This work addresses sample efficiency for reinforcement learning practitioners, offering an incremental improvement by matching the best known sample complexity with enhanced performance.

The paper tackles the problem of sample efficiency in reinforcement learning by proposing the VRMPO algorithm, which uses stochastic mirror descent and a novel variance-reduced policy gradient estimator, achieving an O(ε^{-3}) sample complexity to reach an ε-approximate first-order stationary point and outperforming state-of-the-art methods in experiments.

Improving sample efficiency has been a longstanding goal in reinforcement learning. This paper proposes $\mathtt{VRMPO}$ algorithm: a sample efficient policy gradient method with stochastic mirror descent. In $\mathtt{VRMPO}$, a novel variance-reduced policy gradient estimator is presented to improve sample efficiency. We prove that the proposed $\mathtt{VRMPO}$ needs only $\mathcal{O}(ε^{-3})$ sample trajectories to achieve an $ε$-approximate first-order stationary point, which matches the best sample complexity for policy optimization. The extensive experimental results demonstrate that $\mathtt{VRMPO}$ outperforms the state-of-the-art policy gradient methods in various settings.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes