CVNov 26, 2024

MUSE-VL: Modeling Unified VLM through Semantic Discrete Encoding

arXiv:2411.17762v438 citationsh-index: 1
Originality Highly original
AI Analysis

This addresses the challenge of high training complexity and data requirements in unified vision-language models for multimodal AI applications, representing a strong incremental advance.

The paper tackles the problem of aligning visual and language tokens in unified vision-language models by proposing Semantic Discrete Encoding (SDE), which adds semantic constraints to visual tokenizers, reducing training data requirements and improving performance. The method achieved a 4.8% improvement in understanding performance over the previous SOTA Emu3 and surpassed the dedicated understanding model LLaVA-NeXT 34B by 3.7%, while also outperforming existing unified models on visual generation benchmarks.

We introduce MUSE-VL, a Unified Vision-Language Model through Semantic discrete Encoding for multimodal understanding and generation. Recently, the research community has begun exploring unified models for visual generation and understanding. However, existing vision tokenizers (e.g., VQGAN) only consider low-level information, which makes it difficult to align with language tokens. This results in high training complexity and necessitates a large amount of training data to achieve optimal performance. Additionally, their performance is still far from dedicated understanding models. This paper proposes Semantic Discrete Encoding (SDE), which effectively aligns the information of visual tokens and language tokens by adding semantic constraints to the visual tokenizer. This greatly reduces the amount of training data and improves the performance of the unified model. With the same LLM size, our method improved the understanding performance by 4.8% compared to the previous SOTA Emu3 and surpassed the dedicated understanding model LLaVA-NeXT 34B by 3.7%. Our model also surpasses the existing unified models on visual generation benchmarks.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes