CVAIFeb 11

Beyond VLM-Based Rewards: Diffusion-Native Latent Reward Modeling

arXiv:2602.11146v12 citationsh-index: 1
Originality Highly original
AI Analysis

This addresses a computational bottleneck for researchers and practitioners in AI alignment, particularly in image generation, by offering a more efficient alternative to VLM-based rewards, though it is incremental as it builds on existing diffusion methods.

The paper tackles the problem of high computational cost and domain mismatch in preference optimization for diffusion models by proposing DiNa-LRM, a diffusion-native latent reward model that operates directly on noisy diffusion states, achieving performance competitive with state-of-the-art VLMs at a fraction of the computational cost and improving preference optimization dynamics.

Preference optimization for diffusion and flow-matching models relies on reward functions that are both discriminatively robust and computationally efficient. Vision-Language Models (VLMs) have emerged as the primary reward provider, leveraging their rich multimodal priors to guide alignment. However, their computation and memory cost can be substantial, and optimizing a latent diffusion generator through a pixel-space reward introduces a domain mismatch that complicates alignment. In this paper, we propose DiNa-LRM, a diffusion-native latent reward model that formulates preference learning directly on noisy diffusion states. Our method introduces a noise-calibrated Thurstone likelihood with diffusion-noise-dependent uncertainty. DiNa-LRM leverages a pretrained latent diffusion backbone with a timestep-conditioned reward head, and supports inference-time noise ensembling, providing a diffusion-native mechanism for test-time scaling and robust rewarding. Across image alignment benchmarks, DiNa-LRM substantially outperforms existing diffusion-based reward baselines and achieves performance competitive with state-of-the-art VLMs at a fraction of the computational cost. In preference optimization, we demonstrate that DiNa-LRM improves preference optimization dynamics, enabling faster and more resource-efficient model alignment.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes