CVApr 5

OP-GRPO: Efficient Off-Policy GRPO for Flow-Matching Models

Liyu Zhang, Kehan Li, Tingrui Han, Tao Zhao, Yuxuan Sheng, Shibo He, Chao Li

arXiv:2604.0414271.92 citations

AI Analysis

This addresses a training efficiency problem for researchers and practitioners using flow-matching models, representing an incremental improvement over existing methods.

The paper tackled the low sample efficiency of GRPO in flow-matching models by introducing OP-GRPO, an off-policy framework that uses a replay buffer and importance sampling, achieving comparable or superior performance with only 34.2% of the training steps on average.

Post training via GRPO has demonstrated remarkable effectiveness in improving the generation quality of flow-matching models. However, GRPO suffers from inherently low sample efficiency due to its on-policy training paradigm. To address this limitation, we present OP-GRPO, the first Off-Policy GRPO framework tailored for flow-matching models. First, we actively select high-quality trajectories and adaptively incorporate them into a replay buffer for reuse in subsequent training iterations. Second, to mitigate the distribution shift introduced by off-policy samples, we propose a sequence-level importance sampling correction that preserves the integrity of GRPO's clipping mechanism while ensuring stable policy updates. Third, we theoretically and empirically show that late denoising steps yield ill-conditioned off-policy ratios, and mitigate this by truncating trajectories at late steps. Across image and video generation benchmarks, OP-GRPO achieves comparable or superior performance to Flow-GRPO with only 34.2% of the training steps on average, yielding substantial gains in training efficiency while maintaining generation quality.

View on arXiv PDF

Similar