Diffusion Guidance Is a Controllable Policy Improvement Operator
This work addresses the scalability problem in reinforcement learning for AI researchers, offering a novel integration with generative models that is incremental but practical.
The paper tackles the challenge of scaling reinforcement learning by combining it with scalable generative modeling techniques, showing that diffusion guidance can serve as a controllable policy improvement operator. The resulting CFGRL framework improves on data policies without explicitly learning a value function, achieving increased performance with higher guidance weighting across offline RL tasks.
At the core of reinforcement learning is the idea of learning beyond the performance in the data. However, scaling such systems has proven notoriously tricky. In contrast, techniques from generative modeling have proven remarkably scalable and are simple to train. In this work, we combine these strengths, by deriving a direct relation between policy improvement and guidance of diffusion models. The resulting framework, CFGRL, is trained with the simplicity of supervised learning, yet can further improve on the policies in the data. On offline RL tasks, we observe a reliable trend -- increased guidance weighting leads to increased performance. Of particular importance, CFGRL can operate without explicitly learning a value function, allowing us to generalize simple supervised methods (e.g., goal-conditioned behavioral cloning) to further prioritize optimality, gaining performance for "free" across the board.