CVNov 16, 2024Code
SMLNet: A SPD Manifold Learning Network for Infrared and Visible Image FusionHuan Kang, Hui Li, Tianyang Xu et al.
Euclidean representation learning methods have achieved promising results in image fusion tasks, which can be attributed to their clear advantages in handling with linear space. However, data collected from a realistic scene usually has a non-Euclidean structure, evaluating the consistency of latent representations from paired views using Euclidean distance raises challenges. To address this issue, a novel SPD (symmetric positive definite) manifold learning is proposed for multi-modal image fusion, named SMLNet, which extends the image fusion approach from the Euclidean space to the SPD manifolds. Specifically, we encode images according to the Riemannian geometry to exploit their intrinsic statistical correlations, thereby aligning with human visual perception. The SPD matrix fundamentally underpins our network's learning process. Building upon this mathematical foundation, we employ a cross-modal fusion strategy to exploit modality-specific dependencies and augment complementary information. To capture semantic similarity in images' intrinsic space, we further develop an attention module that meticulously processes the cross-modal semantic affinity matrix. Based on this, we design an end-to-end fusion network based on cross-modal manifold learning. Extensive experiments on public datasets demonstrate that our framework exhibits superior performance compared to the current state-of-the-art methods. Our code will be publicly available at https://github.com/Shaoyun2023.
CVJun 17, 2025
GrFormer: A Novel Transformer on Grassmann Manifold for Infrared and Visible Image FusionHuan Kang, Hui Li, Xiao-Jun Wu et al.
In the field of image fusion, promising progress has been made by modeling data from different modalities as linear subspaces. However, in practice, the source images are often located in a non-Euclidean space, where the Euclidean methods usually cannot encapsulate the intrinsic topological structure. Typically, the inner product performed in the Euclidean space calculates the algebraic similarity rather than the semantic similarity, which results in undesired attention output and a decrease in fusion performance. While the balance of low-level details and high-level semantics should be considered in infrared and visible image fusion task. To address this issue, in this paper, we propose a novel attention mechanism based on Grassmann manifold for infrared and visible image fusion (GrFormer). Specifically, our method constructs a low-rank subspace mapping through projection constraints on the Grassmann manifold, compressing attention features into subspaces of varying rank levels. This forces the features to decouple into high-frequency details (local low-rank) and low-frequency semantics (global low-rank), thereby achieving multi-scale semantic fusion. Additionally, to effectively integrate the significant information, we develop a cross-modal fusion strategy (CMS) based on a covariance mask to maximise the complementary properties between different modalities and to suppress the features with high correlation, which are deemed redundant. The experimental results demonstrate that our network outperforms SOTA methods both qualitatively and quantitatively on multiple image fusion benchmarks. The codes are available at https://github.com/Shaoyun2023.