Connections between reinforcement learning with feedback,test-time scaling, and diffusion guidance: An anthology
This work provides theoretical insights for researchers in machine learning, but it is incremental as it clarifies existing connections rather than proposing new methods.
The paper identifies fundamental connections between post-training techniques like reinforcement learning with feedback and test-time scaling, and introduces a resampling approach for alignment in diffusion models without explicit reinforcement learning.
In this note, we reflect on several fundamental connections among widely used post-training techniques. We clarify some intimate connections and equivalences between reinforcement learning with human feedback, reinforcement learning with internal feedback, and test-time scaling (particularly soft best-of-$N$ sampling), while also illuminating intrinsic links between diffusion guidance and test-time scaling. Additionally, we introduce a resampling approach for alignment and reward-directed diffusion models, sidestepping the need for explicit reinforcement learning techniques.