REAR: Rethinking Visual Autoregressive Models via Generator-Tokenizer Consistency Regularization
This work addresses a core bottleneck in visual autoregressive generation for AI researchers, offering an incremental improvement through a simple regularization technique.
The paper tackles the performance gap between visual autoregressive models and diffusion models by addressing generator-tokenizer inconsistency, proposing a training strategy called reAR that improves image generation metrics, such as reducing gFID from 3.02 to 1.86 on ImageNet and achieving competitive results with fewer parameters.
Visual autoregressive (AR) generation offers a promising path toward unifying vision and language models, yet its performance remains suboptimal against diffusion models. Prior work often attributes this gap to tokenizer limitations and rasterization ordering. In this work, we identify a core bottleneck from the perspective of generator-tokenizer inconsistency, i.e., the AR-generated tokens may not be well-decoded by the tokenizer. To address this, we propose reAR, a simple training strategy introducing a token-wise regularization objective: when predicting the next token, the causal transformer is also trained to recover the visual embedding of the current token and predict the embedding of the target token under a noisy context. It requires no changes to the tokenizer, generation order, inference pipeline, or external models. Despite its simplicity, reAR substantially improves performance. On ImageNet, it reduces gFID from 3.02 to 1.86 and improves IS to 316.9 using a standard rasterization-based tokenizer. When applied to advanced tokenizers, it achieves a gFID of 1.42 with only 177M parameters, matching the performance with larger state-of-the-art diffusion models (675M).