CVLGJan 5

VAR RL Done Right: Tackling Asynchronous Policy Conflicts in Visual Autoregressive Generation

arXiv:2601.02256v12 citationsh-index: 6
Originality Incremental advance
AI Analysis

This addresses a specific bottleneck in visual generation for researchers and practitioners using VAR models, representing an incremental advance in RL-based optimization.

The paper tackled asynchronous policy conflicts in Visual Autoregressive (VAR) models during reinforcement learning, which cause unstable training and suboptimal alignment, and proposed a framework enhancing Group Relative Policy Optimization (GRPO) with stabilizing rewards, dynamic reweighting, and mask propagation, resulting in significant improvements in sample quality and alignment over the baseline.

Visual generation is dominated by three paradigms: AutoRegressive (AR), diffusion, and Visual AutoRegressive (VAR) models. Unlike AR and diffusion, VARs operate on heterogeneous input structures across their generation steps, which creates severe asynchronous policy conflicts. This issue becomes particularly acute in reinforcement learning (RL) scenarios, leading to unstable training and suboptimal alignment. To resolve this, we propose a novel framework to enhance Group Relative Policy Optimization (GRPO) by explicitly managing these conflicts. Our method integrates three synergistic components: 1) a stabilizing intermediate reward to guide early-stage generation; 2) a dynamic time-step reweighting scheme for precise credit assignment; and 3) a novel mask propagation algorithm, derived from principles of Reward Feedback Learning (ReFL), designed to isolate optimization effects both spatially and temporally. Our approach demonstrates significant improvements in sample quality and objective alignment over the vanilla GRPO baseline, enabling robust and effective optimization for VAR models.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes