CVAIJul 11, 2025

Vision Foundation Models as Effective Visual Tokenizers for Autoregressive Image Generation

arXiv:2507.08441v218 citationsh-index: 12Has Code
Originality Highly original
AI Analysis

This work addresses the challenge of efficient and high-quality image tokenization for autoregressive generation, which is incremental as it builds on existing vision foundation models with novel enhancements.

The paper tackles the problem of building an image tokenizer for autoregressive image generation by using a frozen vision foundation model as an encoder, introducing region-adaptive quantization and a semantic reconstruction objective. It results in improved image reconstruction and generation quality, achieving a gFID of 1.36 on ImageNet benchmarks, accelerating convergence by three times, and enabling high-fidelity class-conditional synthesis without classifier-free guidance.

In this work, we present a novel direction to build an image tokenizer directly on top of a frozen vision foundation model, which is a largely underexplored area. Specifically, we employ a frozen vision foundation model as the encoder of our tokenizer. To enhance its effectiveness, we introduce two key components: (1) a region-adaptive quantization framework that reduces redundancy in the pre-trained features on regular 2D grids, and (2) a semantic reconstruction objective that aligns the tokenizer's outputs with the foundation model's representations to preserve semantic fidelity. Based on these designs, our proposed image tokenizer, VFMTok, achieves substantial improvements in image reconstruction and generation quality, while also enhancing token efficiency. It further boosts autoregressive (AR) generation -- achieving a gFID of 1.36 on ImageNet benchmarks, while accelerating model convergence by three times, and enabling high-fidelity class-conditional synthesis without the need for classifier-free guidance (CFG). The code is available at https://github.com/CVMI-Lab/VFMTok.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes