CVJan 20, 2025

MIFNet: Learning Modality-Invariant Features for Generalizable Multimodal Image Matching

Yepeng Liu, Zhichao Sun, Baosheng Yu, Yitian Zhao, Bo Du, Yongchao Xu, Jun Cheng

arXiv:2501.11299v311.812 citationsh-index: 28Has CodeIEEE Transactions on Image Processing

Originality Incremental advance

AI Analysis

This addresses the challenge of costly aligned multimodal data for image matching in medical and remote sensing applications, though it is incremental as it builds on existing keypoint methods.

The paper tackles the problem of multimodal image matching by proposing MIFNet, which learns modality-invariant features using only single-modality training data, achieving good zero-shot generalization across retinal and remote sensing datasets.

Many keypoint detection and description methods have been proposed for image matching or registration. While these methods demonstrate promising performance for single-modality image matching, they often struggle with multimodal data because the descriptors trained on single-modality data tend to lack robustness against the non-linear variations present in multimodal data. Extending such methods to multimodal image matching often requires well-aligned multimodal data to learn modality-invariant descriptors. However, acquiring such data is often costly and impractical in many real-world scenarios. To address this challenge, we propose a modality-invariant feature learning network (MIFNet) to compute modality-invariant features for keypoint descriptions in multimodal image matching using only single-modality training data. Specifically, we propose a novel latent feature aggregation module and a cumulative hybrid aggregation module to enhance the base keypoint descriptors trained on single-modality data by leveraging pre-trained features from Stable Diffusion models. %, our approach generates robust and invariant features across diverse and unknown modalities. We validate our method with recent keypoint detection and description methods in three multimodal retinal image datasets (CF-FA, CF-OCT, EMA-OCTA) and two remote sensing datasets (Optical-SAR and Optical-NIR). Extensive experiments demonstrate that the proposed MIFNet is able to learn modality-invariant feature for multimodal image matching without accessing the targeted modality and has good zero-shot generalization ability. The code will be released at https://github.com/lyp-deeplearning/MIFNet.

View on arXiv PDF Code

Similar