LGMay 17
Reasoning Portability: Guiding Continual Learning for MLLMs in the RLVR EraQiuhe Hong, Yuyang Liu, Shuo Yang et al.
Vision-Language Models in Continual Learning (VLM-CL) aim to continuously adapt to new multimodal tasks while retaining prior knowledge. The emerging paradigm that couples Multimodal Large Language Models (MLLMs) with Reinforcement Learning with Verifiable Rewards (RLVR) calls for a new pattern to guide continual adaptation. Advances in reasoning capability now make it feasible to impose constraints at the reasoning level. We formalize portability, a sample-level measure of how reusable the previous policy's behavior is on a new task, and empirically show that reasoning-level signals remain reliable on out-of-distribution samples while answer-level signals do not. We instantiate this as Reasoning Portability (RP) and propose Reasoning-based Dynamic Balance Continual Learning (RDB-CL), which modulates the per-sample Kullback-Leibler regularization in RLVR according to RP: a tight anchor preserves reusable reasoning on high-RP samples, while a relaxed anchor on low-RP samples permits exploration of new reasoning pathways. Experiments show that RDB-CL consistently outperforms baselines, improving Last accuracy by +12.0% over the vanilla RLVR baseline.
CVAug 6, 2025Code
Continual Learning for VLMs: A Survey and Taxonomy Beyond ForgettingYuyang Liu, Qiuhe Hong, Linlan Huang et al.
Vision-language models (VLMs) have achieved impressive performance across diverse multimodal tasks by leveraging large-scale pre-training. However, enabling them to learn continually from non-stationary data remains a major challenge, as their cross-modal alignment and generalization capabilities are particularly vulnerable to catastrophic forgetting. Unlike traditional unimodal continual learning (CL), VLMs face unique challenges such as cross-modal feature drift, parameter interference due to shared architectures, and zero-shot capability erosion. This survey offers the first focused and systematic review of continual learning for VLMs (VLM-CL). We begin by identifying the three core failure modes that degrade performance in VLM-CL. Based on these, we propose a challenge-driven taxonomy that maps solutions to their target problems: (1) \textit{Multi-Modal Replay Strategies} address cross-modal drift through explicit or implicit memory mechanisms; (2) \textit{Cross-Modal Regularization} preserves modality alignment during updates; and (3) \textit{Parameter-Efficient Adaptation} mitigates parameter interference with modular or low-rank updates. We further analyze current evaluation protocols, datasets, and metrics, highlighting the need for better benchmarks that capture VLM-specific forgetting and compositional generalization. Finally, we outline open problems and future directions, including continual pre-training and compositional zero-shot learning. This survey aims to serve as a comprehensive and diagnostic reference for researchers developing lifelong vision-language systems. All resources are available at: https://github.com/YuyangSunshine/Awesome-Continual-learning-of-Vision-Language-Models.
LGJul 26, 2025
GNSP: Gradient Null Space Projection for Preserving Cross-Modal Alignment in VLMs Continual LearningTiantian Peng, Yuyang Liu, Shuo Yang et al.
Contrastive Language-Image Pretraining has demonstrated remarkable zero-shot generalization by aligning visual and textual modalities in a shared embedding space. However, when continuously fine-tuned on diverse tasks, CLIP suffers from catastrophic forgetting and degradation of its embedding alignment, undermining its zero-shot capabilities. In this work, we propose Gradient Null Space Projection (GNSP), an efficient continual learning method that projects task-specific gradients onto the null space of previously learned knowledge. This orthogonal projection mathematically prevents interference with previous tasks without relying on rehearsal or architectural modification. Furthermore, to preserve the inherent generalization property of CLIP, we introduce knowledge distillation and combine it with a modality alignment preservation loss inspired by CLIP pre-training to stabilize the structure of the multimodal embedding space during fine-tuning. On the MTIL benchmark consisting of 11 tasks, our method achieved SOTA performance on both the Average and Last key metrics. More importantly, experiments show that our method successfully maintains the original modality gap and cross-modal retrieval performance of CLIP, confirming its effectiveness in maintaining a robust visual-language space throughout the continual learning process.