Finite Difference Flow Optimization for RL Post-Training of Text-to-Image Models

arXiv:2603.1289360.91 citations
AI Analysis

This work addresses a specific bottleneck in RL-based fine-tuning for diffusion models, offering incremental improvements for researchers and practitioners in image synthesis.

The paper tackles the problem of high variance in reinforcement learning updates for post-training text-to-image models by proposing an online RL variant that samples paired trajectories and optimizes flow velocity, resulting in faster convergence and improved output quality and prompt alignment compared to previous methods.

Reinforcement learning (RL) has become a standard technique for post-training diffusion-based image synthesis models, as it enables learning from reward signals to explicitly improve desirable aspects such as image quality and prompt alignment. In this paper, we propose an online RL variant that reduces the variance in the model updates by sampling paired trajectories and pulling the flow velocity in the direction of the more favorable image. Unlike existing methods that treat each sampling step as a separate policy action, we consider the entire sampling process as a single action. We experiment with both high-quality vision language models and off-the-shelf quality metrics for rewards, and evaluate the outputs using a broad set of metrics. Our method converges faster and yields higher output quality and prompt alignment than previous approaches.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes