Policy Gradient using Weak Derivatives for Reinforcement Learning
This work addresses the problem of high variance in policy gradient methods for continuous reinforcement learning, offering a novel approach that is incremental but provides concrete improvements in sample complexity and performance.
This paper tackles policy search in continuous reinforcement learning by establishing an alternative policy gradient theorem using weak derivatives, which yields unbiased gradient estimates with lower variance than the score-function approach and demonstrates superior performance on the OpenAI gym pendulum environment.
This paper considers policy search in continuous state-action reinforcement learning problems. Typically, one computes search directions using a classic expression for the policy gradient called the Policy Gradient Theorem, which decomposes the gradient of the value function into two factors: the score function and the Q-function. This paper presents four results:(i) an alternative policy gradient theorem using weak (measure-valued) derivatives instead of score-function is established; (ii) the stochastic gradient estimates thus derived are shown to be unbiased and to yield algorithms that converge almost surely to stationary points of the non-convex value function of the reinforcement learning problem; (iii) the sample complexity of the algorithm is derived and is shown to be $O(1/\sqrt(k))$; (iv) finally, the expected variance of the gradient estimates obtained using weak derivatives is shown to be lower than those obtained using the popular score-function approach. Experiments on OpenAI gym pendulum environment show superior performance of the proposed algorithm.