CVMar 24, 2025

CalFuse: Multi-Modal Continual Learning via Feature Calibration and Parameter Fusion

arXiv:2503.18672v8h-index: 6
Originality Incremental advance
AI Analysis

This addresses incremental learning challenges for real-world big data applications using vision-language models, though it appears incremental as it builds on existing CLIP-based approaches.

The paper tackles catastrophic forgetting in multi-modal continual learning by proposing CalFuse, which combines feature calibration and parameter fusion to preserve cross-modal generalization while adapting to new classes, achieving superior performance over state-of-the-art methods on benchmark datasets.

With the proliferation of multi-modal data in large-scale visual recognition systems, enabling models to continuously acquire knowledge from evolving data streams while preserving prior information has become increasingly critical. Class-Continual Learning (CCL) addresses this challenge by incrementally incorporating new class knowledge without revisiting historical data, making it essential for real-world big data applications. While traditional CCL methods rely solely on visual features, recent advances in Vision-Language Models (VLMs) such as CLIP demonstrate significant potential for CCL by leveraging pre-trained multi-modal knowledge. However, existing approaches face challenges in mitigating catastrophic forgetting while maintaining the cross-modal generalization capabilities of VLMs. To address these limitations, we propose CalFuse, a framework that synergizes feature Calibration with parameter Fusion to enable effective multi-modal knowledge integration in continual learning scenarios. CalFuse introduces a dynamic feature calibration mechanism that adaptively balances original CLIP visual representations with task-specific features, preserving the model's intrinsic cross-modal generalization while adapting to new classes. Concurrently, a QR decomposition-based parameter fusion strategy progressively integrates newly acquired knowledge with historical task parameters, maintaining equilibrium between learning new class representations and retaining prior knowledge across sequential tasks. Extensive experiments on benchmark datasets validate the effectiveness of our approach in large-scale multi-modal continual learning settings, demonstrating superior performance over state-of-the-art methods in both average accuracy and final task retention.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes