CVAILGMay 29, 2025

Fine-Tuning Next-Scale Visual Autoregressive Models with Group Relative Policy Optimization

arXiv:2505.23331v25 citationsh-index: 7
AI Analysis

This work addresses alignment challenges for visual generative models, offering an efficient alternative to diffusion-based methods, though it appears incremental in applying existing RL techniques to a specific model type.

The paper tackles the problem of aligning visual autoregressive models with human preferences by fine-tuning them using Group Relative Policy Optimization, resulting in enhanced image quality and style control through RL-driven exploration beyond initial training distributions.

Fine-tuning pre-trained generative models with Reinforcement Learning (RL) has emerged as an effective approach for aligning outputs more closely with nuanced human preferences. In this paper, we investigate the application of Group Relative Policy Optimization (GRPO) to fine-tune next-scale visual autoregressive (VAR) models. Our empirical results demonstrate that this approach enables alignment to intricate reward signals derived from aesthetic predictors and CLIP embeddings, significantly enhancing image quality and enabling precise control over the generation style. Interestingly, by leveraging CLIP, our method can help VAR models generalize beyond their initial ImageNet distribution: through RL-driven exploration, these models can generate images aligned with prompts referencing image styles that were absent during pre-training. In summary, we show that RL-based fine-tuning is both efficient and effective for VAR models, benefiting particularly from their fast inference speeds, which are advantageous for online sampling, an aspect that poses significant challenges for diffusion-based alternatives.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes