CVJan 9

DIFF-MF: A Difference-Driven Channel-Spatial State Space Model for Multi-Modal Image Fusion

arXiv:2601.05538v1h-index: 14
Originality Incremental advance
AI Analysis

This work addresses the challenge of integrating complementary information from multiple source images for applications like autonomous driving and UAV surveillance, representing an incremental improvement over prior state space models.

The paper tackles the problem of multi-modal image fusion, where existing state space models often sacrifice visible details for infrared intensity or vice versa, and proposes DIFF-MF, which outperforms existing methods in visual quality and quantitative evaluation on driving and UAV datasets.

Multi-modal image fusion aims to integrate complementary information from multiple source images to produce high-quality fused images with enriched content. Although existing approaches based on state space model have achieved satisfied performance with high computational efficiency, they tend to either over-prioritize infrared intensity at the cost of visible details, or conversely, preserve visible structure while diminishing thermal target salience. To overcome these challenges, we propose DIFF-MF, a novel difference-driven channel-spatial state space model for multi-modal image fusion. Our approach leverages feature discrepancy maps between modalities to guide feature extraction, followed by a fusion process across both channel and spatial dimensions. In the channel dimension, a channel-exchange module enhances channel-wise interaction through cross-attention dual state space modeling, enabling adaptive feature reweighting. In the spatial dimension, a spatial-exchange module employs cross-modal state space scanning to achieve comprehensive spatial fusion. By efficiently capturing global dependencies while maintaining linear computational complexity, DIFF-MF effectively integrates complementary multi-modal features. Experimental results on the driving scenarios and low-altitude UAV datasets demonstrate that our method outperforms existing approaches in both visual quality and quantitative evaluation.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes