CV AIMar 10, 2025

V2Flow: Unifying Visual Tokenization and Large Language Model Vocabularies for Autoregressive Image Generation

Guiwei Zhang, Tianyu Zhang, Mohan Zhou, Yalong Bai, Biye Li

arXiv:2503.07493v15 citationsh-index: 5Has Code

Originality Incremental advance

AI Analysis

This work addresses the challenge of integrating visual data into existing LLMs for image generation, which is an incremental advancement in multimodal AI.

The paper tackles the problem of aligning visual tokenization with large language model vocabularies to enable autoregressive image generation, achieving high-fidelity reconstruction and outperforming mainstream VQ-based tokenizers in experiments.

We propose V2Flow, a novel tokenizer that produces discrete visual tokens capable of high-fidelity reconstruction, while ensuring structural and latent distribution alignment with the vocabulary space of large language models (LLMs). Leveraging this tight visual-vocabulary coupling, V2Flow enables autoregressive visual generation on top of existing LLMs. Our approach formulates visual tokenization as a flow-matching problem, aiming to learn a mapping from a standard normal prior to the continuous image distribution, conditioned on token sequences embedded within the LLMs vocabulary space. The effectiveness of V2Flow stems from two core designs. First, we propose a Visual Vocabulary resampler, which compresses visual data into compact token sequences, with each represented as a soft categorical distribution over LLM's vocabulary. This allows seamless integration of visual tokens into existing LLMs for autoregressive visual generation. Second, we present a masked autoregressive Rectified-Flow decoder, employing a masked transformer encoder-decoder to refine visual tokens into contextually enriched embeddings. These embeddings then condition a dedicated velocity field for precise reconstruction. Additionally, an autoregressive rectified-flow sampling strategy is incorporated, ensuring flexible sequence lengths while preserving competitive reconstruction quality. Extensive experiments show that V2Flow outperforms mainstream VQ-based tokenizers and facilitates autoregressive visual generation on top of existing. https://github.com/zhangguiwei610/V2Flow

View on arXiv PDF Code

Similar