Continual Cross-Modal Generalization
This work addresses the problem of cross-modal generalization for AI systems by enabling knowledge transfer across unannotated modalities, though it is incremental as it builds on existing bimodal data approaches like ImageBind.
The paper tackles the challenge of learning a unified representation for multiple modalities without extensive paired data by proposing a continual learning approach that incrementally maps new modalities into a shared discrete codebook. The method achieves strong performance on cross-modal generalization tasks across image-text, audio-text, video-text, and speech-text pairs.
Cross-modal generalization aims to learn a shared discrete representation space from multimodal pairs, enabling knowledge transfer across unannotated modalities. However, achieving a unified representation for all modality pairs requires extensive paired data, which is often impractical. Inspired by the availability of abundant bimodal data (e.g., in ImageBind), we explore a continual learning approach that incrementally maps new modalities into a shared discrete codebook via a mediator modality. We propose the Continual Mixture of Experts Adapter (CMoE-Adapter) to project diverse modalities into a unified space while preserving prior knowledge. To align semantics across stages, we introduce a Pseudo-Modality Replay (PMR) mechanism with a dynamically expanding codebook, enabling the model to adaptively incorporate new modalities using learned ones as guidance. Extensive experiments on image-text, audio-text, video-text, and speech-text show that our method achieves strong performance on various cross-modal generalization tasks. Code is provided in the supplementary material.