LGCVJan 30

Decomposing and Composing: Towards Efficient Vision-Language Continual Learning via Rank-1 Expert Pool in a Single LoRA

arXiv:2601.22828v11 citationsh-index: 28
Originality Highly original
AI Analysis

This work addresses the problem of efficient and effective continual learning for vision-language models, offering a computationally lightweight solution that avoids catastrophic forgetting without heavy inference burdens or external dependencies.

The paper tackles the challenge of catastrophic forgetting in vision-language continual learning by introducing a framework that restructures a single LoRA module into a decomposable Rank-1 Expert Pool, enabling dynamic, sparse task-specific updates. This approach achieves state-of-the-art results across multiple settings, reducing trainable parameters by 96.7% compared to baselines and eliminating reliance on external datasets or task-ID discriminators.

Continual learning (CL) in vision-language models (VLMs) faces significant challenges in improving task adaptation and avoiding catastrophic forgetting. Existing methods usually have heavy inference burden or rely on external knowledge, while Low-Rank Adaptation (LoRA) has shown potential in reducing these issues by enabling parameter-efficient tuning. However, considering directly using LoRA to alleviate the catastrophic forgetting problem is non-trivial, we introduce a novel framework that restructures a single LoRA module as a decomposable Rank-1 Expert Pool. Our method learns to dynamically compose a sparse, task-specific update by selecting from this expert pool, guided by the semantics of the [CLS] token. In addition, we propose an Activation-Guided Orthogonal (AGO) loss that orthogonalizes critical parts of LoRA weights across tasks. This sparse composition and orthogonalization enable fewer parameter updates, resulting in domain-aware learning while minimizing inter-task interference and maintaining downstream task performance. Extensive experiments across multiple settings demonstrate state-of-the-art results in all metrics, surpassing zero-shot upper bounds in generalization. Notably, it reduces trainable parameters by 96.7% compared to the baseline method, eliminating reliance on external datasets or task-ID discriminators. The merged LoRAs retain less weights and incur no inference latency, making our method computationally lightweight.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes