MoME: Mixture of Multimodal Experts for Generalist Multimodal Large Language Models
This work addresses the challenge of task interference for researchers and practitioners in multimodal AI, offering a method to enhance generalist MLLMs, though it appears incremental as it builds on existing expert mixture techniques.
The paper tackles the problem of task interference in generalist multimodal large language models (MLLMs), which causes them to underperform compared to specialist models, by proposing a mixture of multimodal experts (MoME) that includes vision and language experts to adapt to task discrepancies, resulting in significant performance improvements across various vision-language tasks.
Multimodal large language models (MLLMs) have demonstrated impressive capabilities across various vision-language tasks. However, a generalist MLLM typically underperforms compared with a specialist MLLM on most VL tasks, which can be attributed to task interference. In this paper, we propose a mixture of multimodal experts (MoME) to mitigate task interference and obtain a generalist MLLM. Our MoME is composed of two key components, a mixture of vision experts (MoVE) and a mixture of language experts (MoLE). MoVE can adaptively modulate the features transformed from various vision encoders, and has a strong compatibility in transformation architecture. MoLE incorporates sparsely gated experts into LLMs to achieve painless improvements with roughly unchanged inference costs. In response to task interference, our MoME specializes in both vision and language modality to adapt to task discrepancies. Extensive experiments show that MoME significantly improves the performance of generalist MLLMs across various VL tasks. The source code is released at https://github.com/JiuTian-VL/MoME