CVSep 11, 2023

FusionFormer: A Multi-sensory Fusion in Bird's-Eye-View and Temporal Consistent Transformer for 3D Object Detection

arXiv:2309.05257v338 citationsh-index: 5
Originality Incremental advance
AI Analysis

This work addresses a key bottleneck in autonomous driving by improving 3D object detection accuracy through better multi-modal fusion, though it is incremental as it builds on existing transformer and fusion methods.

The paper tackles the problem of information loss in multi-sensor fusion for 3D object detection by proposing FusionFormer, a transformer-based framework that avoids explicit bird's-eye-view transformations and achieves state-of-the-art performance of 72.6% mAP and 75.1% NDS on the nuScenes dataset.

Multi-sensor modal fusion has demonstrated strong advantages in 3D object detection tasks. However, existing methods that fuse multi-modal features require transforming features into the bird's eye view space and may lose certain information on Z-axis, thus leading to inferior performance. To this end, we propose a novel end-to-end multi-modal fusion transformer-based framework, dubbed FusionFormer, that incorporates deformable attention and residual structures within the fusion encoding module. Specifically, by developing a uniform sampling strategy, our method can easily sample from 2D image and 3D voxel features spontaneously, thus exploiting flexible adaptability and avoiding explicit transformation to the bird's eye view space during the feature concatenation process. We further implement a residual structure in our feature encoder to ensure the model's robustness in case of missing an input modality. Through extensive experiments on a popular autonomous driving benchmark dataset, nuScenes, our method achieves state-of-the-art single model performance of 72.6% mAP and 75.1% NDS in the 3D object detection task without test time augmentation.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes