AI LGApr 14

RationalRewards: Reasoning Rewards Scale Visual Generation Both Training and Test Time

Haozhe Wang, Cong Wei, Weiming Ren, Jiaming Liu, Fangzhen Lin, Wenhu Chen

arXiv:2604.1162699.04 citationsh-index: 16Has Code

Predicted impact top 1% in AI · last 90 daysOriginality Highly original

AI Analysis

For practitioners of text-to-image and image-editing generation, this work provides a more interpretable and effective reward model that enhances generator performance without requiring additional parameter updates at test time.

The paper introduces RationalRewards, a reward model that produces explicit, multi-dimensional critiques before scoring, improving visual generation both at training time (via structured rationales for RL) and test time (via a Generate-Critique-Refine loop). The model achieves state-of-the-art preference prediction among open-source models, competitive with Gemini-2.5-Pro, using 10-20x less training data, and its test-time loop matches or exceeds RL fine-tuning on several benchmarks.

Most reward models for visual generation reduce rich human judgments to a single unexplained score, discarding the reasoning that underlies preference. We show that teaching reward models to produce explicit, multi-dimensional critiques before scoring transforms them from passive evaluators into active optimization tools, improving generators in two complementary ways: at training time, structured rationales provide interpretable, fine-grained rewards for reinforcement learning; at test time, a Generate-Critique-Refine loop turns critiques into targeted prompt revisions that improve outputs without any parameter updates. To train such a reward model without costly rationale annotations, we introduce Preference-Anchored Rationalization (PARROT), a principled framework that recovers high-quality rationales from readily available preference data through anchored generation, consistency filtering, and distillation. The resulting model, RationalRewards (8B), achieves state-of-the-art preference prediction among open-source reward models, competitive with Gemini-2.5-Pro, while using 10-20x less training data than comparable baselines. As an RL reward, it consistently improves text-to-image and image-editing generators beyond scalar alternatives. Most strikingly, its test-time critique-and-refine loop matches or exceeds RL-based fine-tuning on several benchmarks, suggesting that structured reasoning can unlock latent capabilities in existing generators that suboptimal prompts fail to elicit.

View on arXiv PDF

Similar