Integrating Reinforcement Learning with Visual Generative Models: Foundations and Advances
This work addresses the challenge of improving visual content generation for researchers and practitioners by systematically integrating RL methods, though it is incremental as a survey rather than introducing new techniques.
This survey tackles the problem of misalignment between surrogate objectives in visual generative models and perceptual quality, semantic accuracy, or physical realism, by reviewing how reinforcement learning (RL) can optimize non-differentiable and preference-driven objectives to enhance controllability, consistency, and human alignment across image, video, and 3D/4D generation tasks.
Generative models have made significant progress in synthesizing visual content, including images, videos, and 3D/4D structures. However, they are typically trained with surrogate objectives such as likelihood or reconstruction loss, which often misalign with perceptual quality, semantic accuracy, or physical realism. Reinforcement learning (RL) offers a principled framework for optimizing non-differentiable, preference-driven, and temporally structured objectives. Recent advances demonstrate its effectiveness in enhancing controllability, consistency, and human alignment across generative tasks. This survey provides a systematic overview of RL-based methods for visual content generation. We review the evolution of RL from classical control to its role as a general-purpose optimization tool, and examine its integration into image, video, and 3D/4D generation. Across these domains, RL serves not only as a fine-tuning mechanism but also as a structural component for aligning generation with complex, high-level goals. We conclude with open challenges and future research directions at the intersection of RL and generative modeling.