CV LGMay 15

Embedding-perturbed Exploration Preference Optimization for Flow Models

Sujie Hu, Chubin Chen, Jiashu Zhu, Jiahong Wu, Xiangxiang Chu, Xiu Li

arXiv:2605.1580396.71 citations

Predicted impact top 7% in CV · last 90 daysOriginality Incremental advance

AI Analysis

For researchers working on alignment of generative models using RL, this work tackles a critical bottleneck in group-based optimization, though the improvement is incremental over existing methods.

The paper addresses the problem of variance decay in group-based RL optimization for generative models, which leads to training instability and stagnation. The proposed E²PO method introduces embedding-level perturbations to maintain variance, achieving superior alignment with human preferences over state-of-the-art baselines.

Recent advancements have established Reinforcement Learning (RL) as a pivotal paradigm for aligning generative models with human intent. However, group-based optimization frameworks (e.g., GRPO) face a critical limitation: the rapid decay of intra-group variance. As the distinctiveness among samples within a group diminishes, the variance approaches zero. This eliminates the very learning signal required for optimization, rendering the process unstable and forcing the policy into premature stagnation or reward hacking. Existing strategies, such as varying the initial noise or increasing group sizes, often fail to address this fundamental issue, resulting in training instability or diminishing returns. To overcome these challenges, we propose $\textbf{Embedding-perturbed Exploration Preference Optimization (}E^2\textbf{PO)}$, a novel framework that sustains optimization through embedding-level perturbation. Our method introduces structured, embedding-level perturbations within sample groups, guaranteeing a robust variance that preserves the discriminative signal throughout the training process. Extensive experiments demonstrate that our approach significantly outperforms state-of-the-art baselines, achieving a more faithful alignment with human preference.

View on arXiv PDF

Similar