CVNov 14, 2025

Preserving Cross-Modal Consistency for CLIP-based Class-Incremental Learning

arXiv:2511.10974v11 citationsh-index: 19
Originality Incremental advance
AI Analysis

This addresses the challenge of continual learning for vision-language models, which is important for real-world AI systems that need to adapt over time, though it appears incremental as it builds on existing CLIP-based CIL approaches.

The paper tackles the problem of class-incremental learning with CLIP models, where learning new classes causes classifier bias and distributional drift, by proposing DMC and DMC-OT frameworks that decouple vision and text adaptation and use optimal-transport calibration. The result is state-of-the-art performance with DMC-OT improving accuracy by an average of 1.80% across multiple datasets.

Class-incremental learning (CIL) enables models to continuously learn new categories from sequential tasks without forgetting previously acquired knowledge. While recent advances in vision-language models such as CLIP have demonstrated strong generalization across domains, extending them to continual settings remains challenging. In particular, learning task-specific soft prompts for newly introduced classes often leads to severe classifier bias, as the text prototypes overfit to recent categories when prior data are unavailable. In this paper, we propose DMC, a simple yet effective two-stage framework for CLIP-based CIL that decouples the adaptation of the vision encoder and the optimization of textual soft prompts. Each stage is trained with the other frozen, allowing one modality to act as a stable semantic anchor for the other to preserve cross-modal alignment. Furthermore, current CLIP-based CIL approaches typically store class-wise Gaussian statistics for generative replay, yet they overlook the distributional drift that arises when the vision encoder is updated over time. To address this issue, we introduce DMC-OT, an enhanced version of DMC that incorporates an optimal-transport guided calibration strategy to align memory statistics across evolving encoders, along with a task-specific prompting design that enhances inter-task separability. Extensive experiments on CIFAR-100, Imagenet-R, CUB-200, and UCF-101 demonstrate that both DMC and DMC-OT achieve state-of-the-art performance, with DMC-OT further improving accuracy by an average of 1.80%.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes