IVCVOct 30, 2021

Cross-Modality Fusion Transformer for Multispectral Object Detection

arXiv:2111.00273v4332 citationsHas Code
Originality Incremental advance
AI Analysis

This work addresses robust object detection in open-world scenarios by fusing multispectral data, representing an incremental improvement over prior CNN-based methods.

The paper tackles multispectral object detection by proposing a Cross-Modality Fusion Transformer (CFT) that integrates RGB and thermal data using self-attention, achieving state-of-the-art performance on multiple datasets.

Multispectral image pairs can provide the combined information, making object detection applications more reliable and robust in the open world. To fully exploit the different modalities, we present a simple yet effective cross-modality feature fusion approach, named Cross-Modality Fusion Transformer (CFT) in this paper. Unlike prior CNNs-based works, guided by the transformer scheme, our network learns long-range dependencies and integrates global contextual information in the feature extraction stage. More importantly, by leveraging the self attention of the transformer, the network can naturally carry out simultaneous intra-modality and inter-modality fusion, and robustly capture the latent interactions between RGB and Thermal domains, thereby significantly improving the performance of multispectral object detection. Extensive experiments and ablation studies on multiple datasets demonstrate that our approach is effective and achieves state-of-the-art detection performance. Our code and models are available at https://github.com/DocF/multispectral-object-detection.

Code Implementations1 repo
Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes