Fake it till You Make it: Reward Modeling as Discriminative Prediction
This addresses the implementation complexity in reward modeling for visual generative models, offering a more efficient approach for researchers and practitioners, though it is incremental as it builds on adversarial training concepts.
The paper tackles the problem of reward modeling for visual generative models by proposing GAN-RM, which eliminates the need for extensive human-annotated preference data or engineered quality dimensions, using only a few hundred target samples and achieving effectiveness in applications like Best-of-N filtering and DPO.
An effective reward model plays a pivotal role in reinforcement learning for post-training enhancement of visual generative models. However, current approaches of reward modeling suffer from implementation complexity due to their reliance on extensive human-annotated preference data or meticulously engineered quality dimensions that are often incomplete and engineering-intensive. Inspired by adversarial training in generative adversarial networks (GANs), this paper proposes GAN-RM, an efficient reward modeling framework that eliminates manual preference annotation and explicit quality dimension engineering. Our method trains the reward model through discrimination between a small set of representative, unpaired target samples(denoted as Preference Proxy Data) and model-generated ordinary outputs, requiring only a few hundred target samples. Comprehensive experiments demonstrate our GAN-RM's effectiveness across multiple key applications including test-time scaling implemented as Best-of-N sample filtering, post-training approaches like Supervised Fine-Tuning (SFT) and Direct Preference Optimization (DPO). Code and data will be released at https://github.com/Visualignment/GAN-RM.