Compatible Value Gradients for Reinforcement Learning of Continuous Deep Policies
This addresses the challenge of accurate gradient estimation in reinforcement learning for continuous control tasks, though it appears incremental as it builds on existing actor-critic frameworks.
The paper tackled the problem of learning continuous policies in deep reinforcement learning by proposing GProp, a method that learns the gradient of the value-function and uses a deviator-actor-critic model. The result showed GProp is competitive with supervised methods on a contextual bandit task and achieves the best performance to date on the octopus arm benchmark.
This paper proposes GProp, a deep reinforcement learning algorithm for continuous policies with compatible function approximation. The algorithm is based on two innovations. Firstly, we present a temporal-difference based method for learning the gradient of the value-function. Secondly, we present the deviator-actor-critic (DAC) model, which comprises three neural networks that estimate the value function, its gradient, and determine the actor's policy respectively. We evaluate GProp on two challenging tasks: a contextual bandit problem constructed from nonparametric regression datasets that is designed to probe the ability of reinforcement learning algorithms to accurately estimate gradients; and the octopus arm, a challenging reinforcement learning benchmark. GProp is competitive with fully supervised methods on the bandit task and achieves the best performance to date on the octopus arm.