E-GRPO: High Entropy Steps Drive Effective Reinforcement Learning for Flow Models
This work addresses a specific bottleneck in aligning flow models with human preferences, representing an incremental improvement over existing methods.
The paper tackles the problem of sparse and ambiguous reward signals in reinforcement learning for flow matching models by proposing E-GRPO, which increases entropy in SDE sampling steps to improve exploration, resulting in demonstrated effectiveness across different reward settings.
Recent reinforcement learning has enhanced the flow matching models on human preference alignment. While stochastic sampling enables the exploration of denoising directions, existing methods which optimize over multiple denoising steps suffer from sparse and ambiguous reward signals. We observe that the high entropy steps enable more efficient and effective exploration while the low entropy steps result in undistinguished roll-outs. To this end, we propose E-GRPO, an entropy aware Group Relative Policy Optimization to increase the entropy of SDE sampling steps. Since the integration of stochastic differential equations suffer from ambiguous reward signals due to stochasticity from multiple steps, we specifically merge consecutive low entropy steps to formulate one high entropy step for SDE sampling, while applying ODE sampling on other steps. Building upon this, we introduce multi-step group normalized advantage, which computes group-relative advantages within samples sharing the same consolidated SDE denoising step. Experimental results on different reward settings have demonstrated the effectiveness of our methods.