CVFeb 7, 2025

QLIP: Text-Aligned Visual Tokenization Unifies Auto-Regressive Multimodal Understanding and Generation

arXiv:2502.05178v130 citationsh-index: 20
Originality Incremental advance
AI Analysis

This addresses the challenge of integrating visual and language modalities for AI systems, offering a drop-in replacement for visual encoders and tokenizers, though it appears incremental as it builds on existing frameworks like LLaVA and LlamaGen.

The paper tackles the problem of unifying multimodal understanding and generation by introducing QLIP, a visual tokenization method that achieves state-of-the-art reconstruction quality and zero-shot image understanding, enabling a single model to perform both tasks with comparable or better performance than existing methods.

We introduce Quantized Language-Image Pretraining (QLIP), a visual tokenization method that combines state-of-the-art reconstruction quality with state-of-the-art zero-shot image understanding. QLIP trains a binary-spherical-quantization-based autoencoder with reconstruction and language-image alignment objectives. We are the first to show that the two objectives do not need to be at odds. We balance the two loss terms dynamically during training and show that a two-stage training pipeline effectively mixes the large-batch requirements of image-language pre-training with the memory bottleneck imposed by the reconstruction objective. We validate the effectiveness of QLIP for multimodal understanding and text-conditioned image generation with a single model. Specifically, QLIP serves as a drop-in replacement for the visual encoder for LLaVA and the image tokenizer for LlamaGen with comparable or even better performance. Finally, we demonstrate that QLIP enables a unified mixed-modality auto-regressive model for understanding and generation.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes