LGCVMay 17

Reasoning Portability: Guiding Continual Learning for MLLMs in the RLVR Era

arXiv:2605.1890389.3
Predicted impact top 9% in LG · last 90 daysOriginality Highly original
AI Analysis

For researchers working on continual learning for MLLMs, this work provides a novel reasoning-level signal that outperforms answer-level signals, enabling more effective adaptation to new tasks while mitigating forgetting.

The paper tackles continual learning for Multimodal Large Language Models (MLLMs) in the Reinforcement Learning with Verifiable Rewards (RLVR) paradigm. It introduces Reasoning Portability (RP), a sample-level measure of policy reusability, and proposes RDB-CL which dynamically adjusts regularization based on RP, achieving a +12.0% improvement in Last accuracy over the vanilla RLVR baseline.

Vision-Language Models in Continual Learning (VLM-CL) aim to continuously adapt to new multimodal tasks while retaining prior knowledge. The emerging paradigm that couples Multimodal Large Language Models (MLLMs) with Reinforcement Learning with Verifiable Rewards (RLVR) calls for a new pattern to guide continual adaptation. Advances in reasoning capability now make it feasible to impose constraints at the reasoning level. We formalize portability, a sample-level measure of how reusable the previous policy's behavior is on a new task, and empirically show that reasoning-level signals remain reliable on out-of-distribution samples while answer-level signals do not. We instantiate this as Reasoning Portability (RP) and propose Reasoning-based Dynamic Balance Continual Learning (RDB-CL), which modulates the per-sample Kullback-Leibler regularization in RLVR according to RP: a tight anchor preserves reusable reasoning on high-RP samples, while a relaxed anchor on low-RP samples permits exploration of new reasoning pathways. Experiments show that RDB-CL consistently outperforms baselines, improving Last accuracy by +12.0% over the vanilla RLVR baseline.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes