CVMar 24, 2025

OCCO: LVM-guided Infrared and Visible Image Fusion Framework based on Object-aware and Contextual COntrastive Learning

Hui Li, Congcong Bian, Zeyang Zhang, Xiaoning Song, Xi Li, Xiao-Jun Wu

arXiv:2503.18635v12 citationsh-index: 8Int J Comput Vis

Originality Incremental advance

AI Analysis

This work addresses a specific issue in computer vision for image fusion applications, offering an incremental improvement by integrating pre-trained LVMs and contrastive learning to enhance fusion quality and downstream task efficacy.

The paper tackled the problem in image fusion where balancing high-quality fused images and downstream task performance is challenging, proposing OCCO, an LVM-guided framework with object-aware and contextual contrastive learning, which validated effectiveness against eight state-of-the-art methods on four datasets and demonstrated exceptional downstream performance.

Image fusion is a crucial technique in the field of computer vision, and its goal is to generate high-quality fused images and improve the performance of downstream tasks. However, existing fusion methods struggle to balance these two factors. Achieving high quality in fused images may result in lower performance in downstream visual tasks, and vice versa. To address this drawback, a novel LVM (large vision model)-guided fusion framework with Object-aware and Contextual COntrastive learning is proposed, termed as OCCO. The pre-trained LVM is utilized to provide semantic guidance, allowing the network to focus solely on fusion tasks while emphasizing learning salient semantic features in form of contrastive learning. Additionally, a novel feature interaction fusion network is also designed to resolve information conflicts in fusion images caused by modality differences. By learning the distinction between positive samples and negative samples in the latent feature space (contextual space), the integrity of target information in fused image is improved, thereby benefiting downstream performance. Finally, compared with eight state-of-the-art methods on four datasets, the effectiveness of the proposed method is validated, and exceptional performance is also demonstrated on downstream visual task.

View on arXiv PDF

Similar