LGAICLCVMay 18

CP-MoE: Consistency-Preserving Mixture-of-Experts for Continual Learning

arXiv:2605.2024775.7
Predicted impact top 19% in LG · last 90 daysOriginality Incremental advance
AI Analysis

For practitioners of continual learning in large language and vision-language models, CP-MoE addresses the trade-off between knowledge transfer and forgetting in MoE architectures.

CP-MoE introduces a consistency-preserving mixture-of-experts framework for continual learning that uses a transient expert to guide stable expert updates, achieving state-of-the-art performance on the SuperNI benchmark with stronger zero-shot transfer and reducing forgetting on VQA v2.

Catastrophic forgetting remains a major obstacle to continual learning in large language models (LLMs) and vision--language models (VLMs). Although Mixture-of-Experts (MoE) architectures offer an efficient path to scaling, existing LoRA-based MoE continual learning methods still face a fundamental trade-off: they either isolate experts too aggressively, limiting knowledge transfer across tasks, or allow task-specific updates to overwrite important existing parameters, leading to severe forgetting. To address this, we propose CP-MoE, a continual learning framework built around a transient expert that captures early task-specific updates and guides their integration into stable experts. CP-MoE introduces a consistency-preserving routing bias, which uses the transient expert to estimate representation similarity with stable experts and steer routing towards more compatible expert selection, and a transient expert-guided regularisation mechanism, which selectively protects important historical parameters during merging. Together, these components reduce parameter interference and forgetting while preserving cross-task knowledge transfer. We validate CP-MoE on both unimodal and multimodal continual learning benchmarks with LLM-based and VLM-based MoE models. On SuperNI benchmark, spanning diverse sequential language tasks, CP-MoE achieves state-of-the-art performance and stronger zero-shot transfer to unseen tasks. On VQA v2 dataset, it scales effectively to multimodal visual reasoning, consistently reduces forgetting, and outperforms strong MoE baselines.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes