LG AIFeb 2

SAME: Stabilized Mixture-of-Experts for Multimodal Continual Instruction Tuning

Zhen-Hao Xie, Jun-Tao Tang, Yu-Cheng Shi, Han-Jia Ye, De-Chuan Zhan, Da-Wei Zhou

arXiv:2602.01990v15 citationsh-index: 35

Originality Highly original

AI Analysis

This addresses the challenge of maintaining consistent performance in multimodal large language models as they continually learn new tasks, which is essential for real-world deployment.

The paper tackles the problem of expert routing drift and expert overwriting in multimodal continual instruction tuning, where routing inconsistencies and shared expert degradation occur as models learn new tasks. Their proposed SAME method achieves state-of-the-art performance by stabilizing routing dynamics and regulating expert updates.

Multimodal Large Language Models (MLLMs) achieve strong performance through instruction tuning, but real-world deployment requires them to continually expand their capabilities, making Multimodal Continual Instruction Tuning (MCIT) essential. Recent methods leverage sparse expert routing to promote task specialization, but we find that the expert routing process suffers from drift as the data distribution evolves. For example, a grounding query that previously activated localization experts may instead be routed to irrelevant experts after learning OCR tasks. Meanwhile, the grounding-related experts can be overwritten by new tasks and lose their original functionality. Such failure reflects two problems: router drift, where expert selection becomes inconsistent over time, and expert drift, where shared experts are overwritten across tasks. Therefore, we propose StAbilized Mixture-of-Experts (SAME) for MCIT. To address router drift, SAME stabilizes expert selection by decomposing routing dynamics into orthogonal subspaces and updating only task-relevant directions. To mitigate expert drift, we regulate expert updates via curvature-aware scaling using historical input covariance in a rehearsal-free manner. SAME also introduces adaptive expert activation to freeze selected experts during training, reducing redundant computation and cross-task interference. Extensive experiments demonstrate its SOTA performance.

View on arXiv PDF

Similar