TSJNet: A Multi-modality Target and Semantic Awareness Joint-driven Image Fusion Network
This work addresses the challenge of integrating multiple objectives in image fusion for applications like surveillance or autonomous driving, representing an incremental improvement over existing methods.
The paper tackles the problem of multi-modality image fusion by proposing TSJNet, a network that jointly leverages target and semantic awareness from detection and segmentation tasks to guide fusion, resulting in improved object detection and segmentation performance with average increases of 2.84% in mAP @0.5 and 7.47% in mIoU over state-of-the-art methods.
Multi-modality image fusion involves integrating complementary information from different modalities into a single image. Current methods primarily focus on enhancing image fusion with a single advanced task such as incorporating semantic or object-related information into the fusion process. This method creates challenges in achieving multiple objectives simultaneously. We introduce a target and semantic awareness joint-driven fusion network called TSJNet. TSJNet comprises fusion, detection, and segmentation subnetworks arranged in a series structure. It leverages object and semantically relevant information derived from dual high-level tasks to guide the fusion network. Additionally, We propose a local significant feature extraction module with a double parallel branch structure to fully capture the fine-grained features of cross-modal images and foster interaction among modalities, targets, and segmentation information. We conducted extensive experiments on four publicly available datasets (MSRS, M3FD, RoadScene, and LLVIP). The results demonstrate that TSJNet can generate visually pleasing fused results, achieving an average increase of 2.84% and 7.47% in object detection and segmentation mAP @0.5 and mIoU, respectively, compared to the state-of-the-art methods.