CV LGSep 6, 2024

VILA-U: a Unified Foundation Model Integrating Visual Understanding and Generation

Yecheng Wu, Zhuoyang Zhang, Junyu Chen, Haotian Tang, Dacheng Li, Yunhao Fang, Ligeng Zhu, Enze Xie, Hongxu Yin, Li Yi, Song Han, Yao Lu

arXiv:2409.04429v344.3270 citationsh-index: 37Has Code

Originality Incremental advance

AI Analysis

This addresses the need for simpler and more aligned models in visual AI, though it appears incremental as it builds on existing autoregressive methods.

The paper tackles the problem of misalignment and complexity in visual language models by introducing VILA-U, a unified foundation model that integrates video, image, language understanding and generation using a single autoregressive framework, achieving near state-of-the-art performance.

VILA-U is a Unified foundation model that integrates Video, Image, Language understanding and generation. Traditional visual language models (VLMs) use separate modules for understanding and generating visual content, which can lead to misalignment and increased complexity. In contrast, VILA-U employs a single autoregressive next-token prediction framework for both tasks, eliminating the need for additional components like diffusion models. This approach not only simplifies the model but also achieves near state-of-the-art performance in visual language understanding and generation. The success of VILA-U is attributed to two main factors: the unified vision tower that aligns discrete visual tokens with textual inputs during pretraining, which enhances visual perception, and autoregressive image generation can achieve similar quality as diffusion models with high-quality dataset. This allows VILA-U to perform comparably to more complex models using a fully token-based autoregressive framework.

View on arXiv PDF Code

Similar