CVNov 16, 2025

Text-Guided Channel Perturbation and Pretrained Knowledge Integration for Unified Multi-Modality Image Fusion

arXiv:2511.12432v110.26 citations

Originality Incremental advance

AI Analysis

This addresses the challenge of balancing generalization and performance in multi-modality image fusion for applications like scene perception, though it appears incremental as it builds on existing unified models with specific modules.

The paper tackles the problem of gradient conflicts in unified multi-modality image fusion models by proposing UP-Fusion, which uses semantic-aware channel pruning and text-guided perturbation to enhance feature integration, resulting in performance improvements over existing methods on fusion and downstream tasks.

Multi-modality image fusion enhances scene perception by combining complementary information. Unified models aim to share parameters across modalities for multi-modality image fusion, but large modality differences often cause gradient conflicts, limiting performance. Some methods introduce modality-specific encoders to enhance feature perception and improve fusion quality. However, this strategy reduces generalisation across different fusion tasks. To overcome this limitation, we propose a unified multi-modality image fusion framework based on channel perturbation and pre-trained knowledge integration (UP-Fusion). To suppress redundant modal information and emphasize key features, we propose the Semantic-Aware Channel Pruning Module (SCPM), which leverages the semantic perception capability of a pre-trained model to filter and enhance multi-modality feature channels. Furthermore, we proposed the Geometric Affine Modulation Module (GAM), which uses original modal features to apply affine transformations on initial fusion features to maintain the feature encoder modal discriminability. Finally, we apply a Text-Guided Channel Perturbation Module (TCPM) during decoding to reshape the channel distribution, reducing the dependence on modality-specific channels. Extensive experiments demonstrate that the proposed algorithm outperforms existing methods on both multi-modality image fusion and downstream tasks.

View on arXiv PDF

Similar