Multi-Modal Continual Learning via Cross-Modality Adapters and Representation Alignment with Knowledge Preservation
This work addresses multi-modal continual learning, which is incremental but important for applications like robotics and autonomous systems that require integrating diverse sensory inputs over time.
The paper tackles the problem of catastrophic forgetting in multi-modal continual learning by proposing a framework with cross-modality adapters and representation alignment, achieving higher accuracy and reduced forgetting on several datasets.
Continual learning is essential for adapting models to new tasks while retaining previously acquired knowledge. While existing approaches predominantly focus on uni-modal data, multi-modal learning offers substantial benefits by utilizing diverse sensory inputs, akin to human perception. However, multi-modal continual learning presents additional challenges, as the model must effectively integrate new information from various modalities while preventing catastrophic forgetting. In this work, we propose a pre-trained model-based framework for multi-modal continual learning. Our framework includes a novel cross-modality adapter with a mixture-of-experts structure to facilitate effective integration of multi-modal information across tasks. We also introduce a representation alignment loss that fosters learning of robust multi-modal representations, and regularize relationships between learned representations to preserve knowledge from previous tasks. Experiments on several multi-modal datasets demonstrate that our approach consistently outperforms baselines in both class-incremental and domain-incremental learning, achieving higher accuracy and reduced forgetting.