CVJan 3, 2023

Cross Modal Transformer: Towards Fast and Robust 3D Object Detection

arXiv:2301.01283v3150 citationsh-index: 75Has Code
Originality Highly original
AI Analysis

This work addresses robust and efficient 3D detection for autonomous driving, representing a strong specific gain rather than a foundational breakthrough.

The paper tackles 3D object detection by proposing Cross Modal Transformer (CMT), an end-to-end multi-modal detector that directly outputs 3D bounding boxes from image and point cloud tokens without explicit view transformation, achieving 74.1% NDS on the nuScenes test set with fast inference speed and robustness to missing LiDAR data.

In this paper, we propose a robust 3D detector, named Cross Modal Transformer (CMT), for end-to-end 3D multi-modal detection. Without explicit view transformation, CMT takes the image and point clouds tokens as inputs and directly outputs accurate 3D bounding boxes. The spatial alignment of multi-modal tokens is performed by encoding the 3D points into multi-modal features. The core design of CMT is quite simple while its performance is impressive. It achieves 74.1\% NDS (state-of-the-art with single model) on nuScenes test set while maintaining fast inference speed. Moreover, CMT has a strong robustness even if the LiDAR is missing. Code is released at https://github.com/junjie18/CMT.

Code Implementations2 repos
Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes