LG CVJun 1

Drifting Preference Optimization for One-Step Generative Models

arXiv:2606.0252187.7

Predicted impact top 10% in LG · last 90 daysOriginality Highly original

AI Analysis

This work addresses the challenge of aligning one-step generative models with human preferences, which is important for efficient deployment but previously required complex optimization.

DrPO enables preference finetuning of one-step text-to-image generators without requiring differentiable rewards or denoising trajectories, improving alignment over baselines while reducing training computation by 3.51× in HPSv3 under matched effective-batch settings.

One-step text-to-image generators are attractive for deployment because they generate an image with a single forward pass, but preference finetuning them remains difficult: standard alignment methods often rely on policy likelihoods, denoising trajectories, differentiable reward gradients, or test-time optimization. We propose Drifting Preference Optimization (DrPO), an online preference-finetuning method for deterministic one-step generators. For each prompt, DrPO samples candidates from the current generator, ranks them with a target reward, and uses high- and low-scoring samples to synthesize a feature-space update direction. The update is a non-parametric dipole preference field plus a reference drift estimated from the frozen base generator, and is optimized through a detached feature-space regression target. The target reward is used only for ranking, so DrPO can train with large, black-box, or non-differentiable rewards while inference remains a single generator call. We evaluate DrPO on SD-Turbo and SDXL-Turbo with multiple target rewards and benchmarks, including HPSv3 and GenEval. DrPO improves alignment over reward-gradient-free one-step preference baselines and reduces HPSv3 training computation by $3.51\times$ under the matched effective-batch setting by removing reward-model backpropagation. Initial offline experiments suggest that sample-based gradient synthesis can also be used beyond online reward ranking.

View on arXiv PDF

Similar