UNIV: Unified Foundation Model for Infrared and Visible Modalities
This work addresses robustness in perception under diverse conditions for applications like autonomous driving or surveillance, but it is incremental as it builds on existing foundation model concepts with a new method for cross-modal alignment.
The paper tackles the problem of cross-modal degradation in joint RGB-infrared perception by introducing UNIV, a unified foundation model that uses Patch Cross-modal Contrastive Learning to align infrared and visible modalities, achieving improvements of +1.7 mIoU for semantic segmentation and +0.7 mAP for detection on infrared tasks.
Joint RGB-infrared perception is essential for achieving robustness under diverse weather and illumination conditions. Although foundation models excel within single modalities, they suffer from substantial cross-modal degradation, an issue we attribute to a pattern shortcut, i.e., a modal bias that prioritizes superficial sensor patterns over underlying semantics. To address this problem, we introduce UNIV, a Unified foundation model for Infrared and Visible modalities. At the core of UNIV lies Patch Cross-modal Contrastive Learning (PCCL), a self-supervised contrastive learning strategy that constructs a unified cross-modal feature space. PCCL employs a frozen pre-trained model to sample pseudo patch pairs based on semantic similarity, and aligns infrared-visible representations by attracting semantically related pairs while repelling unrelated ones. This process simultaneously enhances cross-modal alignment and inter-class semantic separability, guiding the model to focus on semantic structure rather than falling into pattern shortcuts. To further enable cross-modal learning, we introduce MVIP, the most comprehensive visible-infrared benchmark to date, containing 98,992 precisely aligned image pairs across diverse scenes. Extensive experiments demonstrate UNIV's superior performance on infrared tasks (+1.7 mIoU for semantic segmentation and +0.7 mAP for detection), while maintaining competitive accuracy on RGB tasks.