AI LG MLApr 29, 2020

How to Learn a Useful Critic? Model-based Action-Gradient-Estimator Policy Optimization

arXiv:2004.14309v218.030 citationsHas Code

Originality Highly original

AI Analysis

This addresses a bottleneck in deterministic-policy actor-critic algorithms for continuous control, offering a more effective critic for policy improvement.

The paper tackles the problem of learning a critic that provides useful gradients for policy optimization in continuous control, proposing MAGE, a model-based actor-critic algorithm that explicitly learns action-value gradients, and demonstrates its efficiency on MuJoCo tasks compared to state-of-the-art baselines.

Deterministic-policy actor-critic algorithms for continuous control improve the actor by plugging its actions into the critic and ascending the action-value gradient, which is obtained by chaining the actor's Jacobian matrix with the gradient of the critic with respect to input actions. However, instead of gradients, the critic is, typically, only trained to accurately predict expected returns, which, on their own, are useless for policy optimization. In this paper, we propose MAGE, a model-based actor-critic algorithm, grounded in the theory of policy gradients, which explicitly learns the action-value gradient. MAGE backpropagates through the learned dynamics to compute gradient targets in temporal difference learning, leading to a critic tailored for policy improvement. On a set of MuJoCo continuous-control tasks, we demonstrate the efficiency of the algorithm in comparison to model-free and model-based state-of-the-art baselines.

View on arXiv PDF Code

Similar