Finite Difference Flow Optimization for RL Post-Training of Text-to-Image Models

David McAllister, Miika Aittala, Tero Karras, Janne Hellsten, Angjoo Kanazawa, Timo Aila, Samuli Laine

arXiv:2603.1289360.91 citations

AI Analysis

This work addresses a specific bottleneck in RL-based fine-tuning for diffusion models, offering incremental improvements for researchers and practitioners in image synthesis.

The paper tackles the problem of high variance in reinforcement learning updates for post-training text-to-image models by proposing an online RL variant that samples paired trajectories and optimizes flow velocity, resulting in faster convergence and improved output quality and prompt alignment compared to previous methods.

Reinforcement learning (RL) has become a standard technique for post-training diffusion-based image synthesis models, as it enables learning from reward signals to explicitly improve desirable aspects such as image quality and prompt alignment. In this paper, we propose an online RL variant that reduces the variance in the model updates by sampling paired trajectories and pulling the flow velocity in the direction of the more favorable image. Unlike existing methods that treat each sampling step as a separate policy action, we consider the entire sampling process as a single action. We experiment with both high-quality vision language models and off-the-shelf quality metrics for rewards, and evaluate the outputs using a broad set of metrics. Our method converges faster and yields higher output quality and prompt alignment than previous approaches.

View on arXiv PDF

Similar