CVJun 4, 2025

Resolving Task Objective Conflicts in Unified Model via Task-Aware Mixture-of-Experts

arXiv:2506.03591v2
Originality Highly original
AI Analysis

This addresses a fundamental bottleneck in unified models for multimodal AI, offering a solution to task interference that improves performance in applications like vision-language tasks.

The paper tackles the problem of task objective conflicts in unified multimodal large language models, which cause suboptimal trade-offs between understanding and generation tasks, and proposes a novel framework that resolves these conflicts to achieve state-of-the-art performance across various multimodal benchmarks.

Unified multimodal large language models (MLLMs) based on end-to-end autoregressive (AR) transformers effectively integrate both understanding and generation tasks within a single framework. However, intrinsic Task Objective Conflicts between high-level semantic abstraction in understanding and fine-grained detail preservation in generation pose significant challenges, often leading to suboptimal trade-offs and task interference. Existing solutions, such as decoupling shared visual encoders, fall short of fundamentally resolving these conflicts due to inherent AR architecture. In this paper, we propose a novel approach that decouples internal components of AR to resolve task objective conflicts. Specifically, we design UTAMoE, a Unified Task-Aware Mixture-of-Experts (MoE) framework that decouples internal AR modules via a Task-Aware MoE Layer to create task-specific optimization subpaths. To enhance task differentiation while maintaining overall coordination, we introduce a novel Two-Stage Training Strategy. Extensive experiments on multimodal benchmarks demonstrate that UTAMoE mitigates task objective conflicts, achieving state-of-the-art performance across various tasks. Visualizations and ablation studies further validate the effectiveness of our approach.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes