Stochastic Variance-Reduced Policy Gradient
This work provides a more efficient algorithm for reinforcement learning practitioners, though it is incremental as it adapts existing supervised learning techniques to a new domain.
The paper tackled the challenge of adapting stochastic variance-reduced gradient methods to policy gradient reinforcement learning, addressing issues like non-concave objectives and non-stationary sampling, resulting in the SVRPG algorithm with convergence guarantees and linear convergence rates under increasing batch sizes.
In this paper, we propose a novel reinforcement- learning algorithm consisting in a stochastic variance-reduced version of policy gradient for solving Markov Decision Processes (MDPs). Stochastic variance-reduced gradient (SVRG) methods have proven to be very successful in supervised learning. However, their adaptation to policy gradient is not straightforward and needs to account for I) a non-concave objective func- tion; II) approximations in the full gradient com- putation; and III) a non-stationary sampling pro- cess. The result is SVRPG, a stochastic variance- reduced policy gradient algorithm that leverages on importance weights to preserve the unbiased- ness of the gradient estimate. Under standard as- sumptions on the MDP, we provide convergence guarantees for SVRPG with a convergence rate that is linear under increasing batch sizes. Finally, we suggest practical variants of SVRPG, and we empirically evaluate them on continuous MDPs.