Diffusion-DRF: Differentiable Reward Flow for Video Diffusion Fine-Tuning
This work addresses the challenge of label-intensive and unstable training in video generation for AI researchers and practitioners, offering a model-agnostic solution that is incremental by building on existing DPO methods.
The paper tackled the problem of fine-tuning video diffusion models by proposing Diffusion-DRF, a method that uses a frozen Vision-Language Model as a differentiable critic to improve video quality and semantic alignment without additional reward models or preference datasets, achieving enhanced performance while mitigating reward hacking and collapse.
Direct Preference Optimization (DPO) has recently improved Text-to-Video (T2V) generation by enhancing visual fidelity and text alignment. However, current methods rely on non-differentiable preference signals from human annotations or learned reward models. This reliance makes training label-intensive, bias-prone, and easy-to-game, which often triggers reward hacking and unstable training. We propose Diffusion-DRF, a differentiable reward flow for fine-tuning video diffusion models using a frozen, off-the-shelf Vision-Language Model (VLM) as a training-free critic. Diffusion-DRF directly backpropagates VLM feedback through the diffusion denoising chain, converting logit-level responses into token-aware gradients for optimization. We propose an automated, aspect-structured prompting pipeline to obtain reliable multi-dimensional VLM feedback, while gradient checkpointing enables efficient updates through the final denoising steps. Diffusion-DRF improves video quality and semantic alignment while mitigating reward hacking and collapse -- without additional reward models or preference datasets. It is model-agnostic and readily generalizes to other diffusion-based generative tasks.