CVNov 16, 2024

Awaker2.5-VL: Stably Scaling MLLMs with Parameter-Efficient Mixture of Experts

arXiv:2411.10669v11 citationsh-index: 13Has Code
Originality Incremental advance
AI Analysis

This addresses performance degradation in MLLMs when handling diverse visual and textual tasks simultaneously, representing an incremental improvement in parameter-efficient scaling.

The paper tackles the multi-task conflict issue in Multimodal Large Language Models (MLLMs) by proposing Awaker2.5-VL, a Mixture of Experts architecture with low-rank adaptation experts, achieving effective performance on various benchmarks as demonstrated in experiments.

As the research of Multimodal Large Language Models (MLLMs) becomes popular, an advancing MLLM model is typically required to handle various textual and visual tasks (e.g., VQA, Detection, OCR, and ChartQA) simultaneously for real-world applications. However, due to the significant differences in representation and distribution among data from various tasks, simply mixing data of all tasks together leads to the well-known``multi-task conflict" issue, resulting in performance degradation across various tasks. To address this issue, we propose Awaker2.5-VL, a Mixture of Experts~(MoE) architecture suitable for MLLM, which acquires the multi-task capabilities through multiple sparsely activated experts. To speed up the training and inference of Awaker2.5-VL, each expert in our model is devised as a low-rank adaptation (LoRA) structure. Extensive experiments on multiple latest benchmarks demonstrate the effectiveness of Awaker2.5-VL. The code and model weight are released in our Project Page: https://github.com/MetabrainAGI/Awaker.

Code Implementations1 repo
Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes