D2 Actor Critic: Diffusion Actor Meets Distributional Critic
This addresses the challenge of stable and effective model-free RL for complex tasks, though it appears incremental as it builds on existing diffusion and distributional RL methods.
The paper tackled the problem of training expressive diffusion policies online in reinforcement learning by introducing D2AC, which avoids high variance and complexity, achieving state-of-the-art performance on eighteen hard RL tasks.
We introduce D2AC, a new model-free reinforcement learning (RL) algorithm designed to train expressive diffusion policies online effectively. At its core is a policy improvement objective that avoids the high variance of typical policy gradients and the complexity of backpropagation through time. This stable learning process is critically enabled by our second contribution: a robust distributional critic, which we design through a fusion of distributional RL and clipped double Q-learning. The resulting algorithm is highly effective, achieving state-of-the-art performance on a benchmark of eighteen hard RL tasks, including Humanoid, Dog, and Shadow Hand domains, spanning both dense-reward and goal-conditioned RL scenarios. Beyond standard benchmarks, we also evaluate a biologically motivated predator-prey task to examine the behavioral robustness and generalization capacity of our approach.