LG AI MLMay 9, 2018

Policy Optimization with Second-Order Advantage Information

arXiv:1805.03586v24.15 citationsHas Code

Originality Incremental advance

AI Analysis

This work addresses variance reduction in policy gradients for high-dimensional continuous control, which is an incremental improvement for reinforcement learning practitioners.

The paper tackles the difficulty of policy optimization in high-dimensional continuous control tasks due to large variance in gradient estimators by proposing the ASDG estimator and POSA algorithm, which incorporate Rao-Blackwell theorem and Control Variates to reduce variance and show performance improvements on synthetic settings and MuJoCo tasks.

Policy optimization on high-dimensional continuous control tasks exhibits its difficulty caused by the large variance of the policy gradient estimators. We present the action subspace dependent gradient (ASDG) estimator which incorporates the Rao-Blackwell theorem (RB) and Control Variates (CV) into a unified framework to reduce the variance. To invoke RB, our proposed algorithm (POSA) learns the underlying factorization structure among the action space based on the second-order advantage information. POSA captures the quadratic information explicitly and efficiently by utilizing the wide & deep architecture. Empirical studies show that our proposed approach demonstrates the performance improvements on high-dimensional synthetic settings and OpenAI Gym's MuJoCo continuous control tasks.

View on arXiv PDF Code

Similar