Policy Optimization with Stochastic Mirror Descent
This work addresses sample efficiency for reinforcement learning practitioners, offering an incremental improvement by matching the best known sample complexity with enhanced performance.
The paper tackles the problem of sample efficiency in reinforcement learning by proposing the VRMPO algorithm, which uses stochastic mirror descent and a novel variance-reduced policy gradient estimator, achieving an O(ε^{-3}) sample complexity to reach an ε-approximate first-order stationary point and outperforming state-of-the-art methods in experiments.
Improving sample efficiency has been a longstanding goal in reinforcement learning. This paper proposes $\mathtt{VRMPO}$ algorithm: a sample efficient policy gradient method with stochastic mirror descent. In $\mathtt{VRMPO}$, a novel variance-reduced policy gradient estimator is presented to improve sample efficiency. We prove that the proposed $\mathtt{VRMPO}$ needs only $\mathcal{O}(ε^{-3})$ sample trajectories to achieve an $ε$-approximate first-order stationary point, which matches the best sample complexity for policy optimization. The extensive experimental results demonstrate that $\mathtt{VRMPO}$ outperforms the state-of-the-art policy gradient methods in various settings.