CVMay 5, 2025

RGBX-DiffusionDet: A Framework for Multi-Modal RGB-X Object Detection Using DiffusionDet

arXiv:2505.02586v3h-index: 5Pattern Recognition
Originality Incremental advance
AI Analysis

This work addresses multimodal object detection for applications like autonomous driving and surveillance, but it is incremental as it builds on an existing diffusion-based method.

The paper tackles object detection by extending DiffusionDet to fuse RGB with other 2D modalities (e.g., depth, polarimetric, infrared) using an adaptive encoder and novel modules, achieving consistent superiority over the RGB-only baseline in experiments on datasets like KITTI and M$^3$FD.

This work introduces RGBX-DiffusionDet, an object detection framework extending the DiffusionDet model to fuse the heterogeneous 2D data (X) with RGB imagery via an adaptive multimodal encoder. To enable cross-modal interaction, we design the dynamic channel reduction within a convolutional block attention module (DCR-CBAM), which facilitates cross-talk between subnetworks by dynamically highlighting salient channel features. Furthermore, the dynamic multi-level aggregation block (DMLAB) is proposed to refine spatial feature representations through adaptive multiscale fusion. Finally, novel regularization losses that enforce channel saliency and spatial selectivity are introduced, leading to compact and discriminative feature embeddings. Extensive experiments using RGB-Depth (KITTI), a novel annotated RGB-Polarimetric dataset, and RGB-Infrared (M$^3$FD) benchmark dataset were conducted. We demonstrate consistent superiority of the proposed approach over the baseline RGB-only DiffusionDet. The modular architecture maintains the original decoding complexity, ensuring efficiency. These results establish the proposed RGBX-DiffusionDet as a flexible multimodal object detection approach, providing new insights into integrating diverse 2D sensing modalities into diffusion-based detection pipelines.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes