CVJun 17, 2025

GrFormer: A Novel Transformer on Grassmann Manifold for Infrared and Visible Image Fusion

arXiv:2506.14384v18 citationsh-index: 17Inf Fusion
Originality Incremental advance
AI Analysis

This work addresses image fusion for infrared and visible modalities, which is important for applications like surveillance and autonomous systems, but appears incremental as it builds on existing subspace modeling approaches.

The paper tackles the problem of infrared and visible image fusion by addressing limitations of Euclidean methods in capturing non-Euclidean data structures, proposing GrFormer with a Grassmann manifold-based attention mechanism and cross-modal fusion strategy. The result is a network that outperforms state-of-the-art methods on multiple benchmarks, as demonstrated by qualitative and quantitative experiments.

In the field of image fusion, promising progress has been made by modeling data from different modalities as linear subspaces. However, in practice, the source images are often located in a non-Euclidean space, where the Euclidean methods usually cannot encapsulate the intrinsic topological structure. Typically, the inner product performed in the Euclidean space calculates the algebraic similarity rather than the semantic similarity, which results in undesired attention output and a decrease in fusion performance. While the balance of low-level details and high-level semantics should be considered in infrared and visible image fusion task. To address this issue, in this paper, we propose a novel attention mechanism based on Grassmann manifold for infrared and visible image fusion (GrFormer). Specifically, our method constructs a low-rank subspace mapping through projection constraints on the Grassmann manifold, compressing attention features into subspaces of varying rank levels. This forces the features to decouple into high-frequency details (local low-rank) and low-frequency semantics (global low-rank), thereby achieving multi-scale semantic fusion. Additionally, to effectively integrate the significant information, we develop a cross-modal fusion strategy (CMS) based on a covariance mask to maximise the complementary properties between different modalities and to suppress the features with high correlation, which are deemed redundant. The experimental results demonstrate that our network outperforms SOTA methods both qualitatively and quantitatively on multiple image fusion benchmarks. The codes are available at https://github.com/Shaoyun2023.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes