CP-MoE: Consistency-Preserving Mixture-of-Experts for Continual Learning

arXiv:2605.2024775.7

Predicted impact top 19% in LG · last 90 daysOriginality Incremental advance

AI Analysis

For practitioners of continual learning in large language and vision-language models, CP-MoE addresses the trade-off between knowledge transfer and forgetting in MoE architectures.

CP-MoE introduces a consistency-preserving mixture-of-experts framework for continual learning that uses a transient expert to guide stable expert updates, achieving state-of-the-art performance on the SuperNI benchmark with stronger zero-shot transfer and reducing forgetting on VQA v2.

Catastrophic forgetting remains a major obstacle to continual learning in large language models (LLMs) and vision--language models (VLMs). Although Mixture-of-Experts (MoE) architectures offer an efficient path to scaling, existing LoRA-based MoE continual learning methods still face a fundamental trade-off: they either isolate experts too aggressively, limiting knowledge transfer across tasks, or allow task-specific updates to overwrite important existing parameters, leading to severe forgetting. To address this, we propose CP-MoE, a continual learning framework built around a transient expert that captures early task-specific updates and guides their integration into stable experts. CP-MoE introduces a consistency-preserving routing bias, which uses the transient expert to estimate representation similarity with stable experts and steer routing towards more compatible expert selection, and a transient expert-guided regularisation mechanism, which selectively protects important historical parameters during merging. Together, these components reduce parameter interference and forgetting while preserving cross-task knowledge transfer. We validate CP-MoE on both unimodal and multimodal continual learning benchmarks with LLM-based and VLM-based MoE models. On SuperNI benchmark, spanning diverse sequential language tasks, CP-MoE achieves state-of-the-art performance and stronger zero-shot transfer to unseen tasks. On VQA v2 dataset, it scales effectively to multimodal visual reasoning, consistently reduces forgetting, and outperforms strong MoE baselines.

View on arXiv PDF

Similar