CVAICLOct 17, 2024

Janus: Decoupling Visual Encoding for Unified Multimodal Understanding and Generation

arXiv:2410.13848v1405 citationsh-index: 15CVPR
Originality Incremental advance
AI Analysis

This addresses a bottleneck in unified multimodal models for AI researchers and practitioners, offering a more flexible and effective approach, though it is incremental as it builds on existing unified architectures.

The paper tackles the suboptimal performance in multimodal understanding caused by using a single visual encoder for both understanding and generation tasks by introducing Janus, an autoregressive framework that decouples visual encoding into separate pathways, resulting in performance that surpasses previous unified models and matches or exceeds task-specific models.

In this paper, we introduce Janus, an autoregressive framework that unifies multimodal understanding and generation. Prior research often relies on a single visual encoder for both tasks, such as Chameleon. However, due to the differing levels of information granularity required by multimodal understanding and generation, this approach can lead to suboptimal performance, particularly in multimodal understanding. To address this issue, we decouple visual encoding into separate pathways, while still leveraging a single, unified transformer architecture for processing. The decoupling not only alleviates the conflict between the visual encoder's roles in understanding and generation, but also enhances the framework's flexibility. For instance, both the multimodal understanding and generation components can independently select their most suitable encoding methods. Experiments show that Janus surpasses previous unified model and matches or exceeds the performance of task-specific models. The simplicity, high flexibility, and effectiveness of Janus make it a strong candidate for next-generation unified multimodal models.

Code Implementations1 repo
Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes