AILGApr 14

RationalRewards: Reasoning Rewards Scale Visual Generation Both Training and Test Time

arXiv:2604.1162699.04 citationsh-index: 16Has Code
Predicted impact top 1% in AI · last 90 daysOriginality Highly original
AI Analysis

For practitioners of text-to-image and image-editing generation, this work provides a more interpretable and effective reward model that enhances generator performance without requiring additional parameter updates at test time.

The paper introduces RationalRewards, a reward model that produces explicit, multi-dimensional critiques before scoring, improving visual generation both at training time (via structured rationales for RL) and test time (via a Generate-Critique-Refine loop). The model achieves state-of-the-art preference prediction among open-source models, competitive with Gemini-2.5-Pro, using 10-20x less training data, and its test-time loop matches or exceeds RL fine-tuning on several benchmarks.

Most reward models for visual generation reduce rich human judgments to a single unexplained score, discarding the reasoning that underlies preference. We show that teaching reward models to produce explicit, multi-dimensional critiques before scoring transforms them from passive evaluators into active optimization tools, improving generators in two complementary ways: at training time, structured rationales provide interpretable, fine-grained rewards for reinforcement learning; at test time, a Generate-Critique-Refine loop turns critiques into targeted prompt revisions that improve outputs without any parameter updates. To train such a reward model without costly rationale annotations, we introduce Preference-Anchored Rationalization (PARROT), a principled framework that recovers high-quality rationales from readily available preference data through anchored generation, consistency filtering, and distillation. The resulting model, RationalRewards (8B), achieves state-of-the-art preference prediction among open-source reward models, competitive with Gemini-2.5-Pro, while using 10-20x less training data than comparable baselines. As an RL reward, it consistently improves text-to-image and image-editing generators beyond scalar alternatives. Most strikingly, its test-time critique-and-refine loop matches or exceeds RL-based fine-tuning on several benchmarks, suggesting that structured reasoning can unlock latent capabilities in existing generators that suboptimal prompts fail to elicit.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes