CVOct 6, 2025

REAR: Rethinking Visual Autoregressive Models via Generator-Tokenizer Consistency Regularization

Qiyuan He, Yicong Li, Haotian Ye, Jinghao Wang, Xinyao Liao, Pheng-Ann Heng, Stefano Ermon, James Zou, Angela Yao

arXiv:2510.04450v15 citationsh-index: 12

Originality Incremental advance

AI Analysis

This work addresses a core bottleneck in visual autoregressive generation for AI researchers, offering an incremental improvement through a simple regularization technique.

The paper tackles the performance gap between visual autoregressive models and diffusion models by addressing generator-tokenizer inconsistency, proposing a training strategy called reAR that improves image generation metrics, such as reducing gFID from 3.02 to 1.86 on ImageNet and achieving competitive results with fewer parameters.

Visual autoregressive (AR) generation offers a promising path toward unifying vision and language models, yet its performance remains suboptimal against diffusion models. Prior work often attributes this gap to tokenizer limitations and rasterization ordering. In this work, we identify a core bottleneck from the perspective of generator-tokenizer inconsistency, i.e., the AR-generated tokens may not be well-decoded by the tokenizer. To address this, we propose reAR, a simple training strategy introducing a token-wise regularization objective: when predicting the next token, the causal transformer is also trained to recover the visual embedding of the current token and predict the embedding of the target token under a noisy context. It requires no changes to the tokenizer, generation order, inference pipeline, or external models. Despite its simplicity, reAR substantially improves performance. On ImageNet, it reduces gFID from 3.02 to 1.86 and improves IS to 316.9 using a standard rasterization-based tokenizer. When applied to advanced tokenizers, it achieves a gFID of 1.42 with only 177M parameters, matching the performance with larger state-of-the-art diffusion models (675M).

View on arXiv PDF

Similar