CVMar 17, 2023

CAPE: Camera View Position Embedding for Multi-View 3D Object Detection

arXiv:2303.10209v152 citationsh-index: 60Has Code
Originality Incremental advance
AI Analysis

This addresses the challenge of multi-view 3D object detection for autonomous driving systems, offering an incremental improvement over existing query-based methods.

The paper tackles the problem of detecting 3D objects from multi-view images by proposing CAPE, a method that uses camera-view position embeddings to simplify learning view transformations, achieving state-of-the-art performance with 61.0% NDS and 52.5% mAP on the nuScenes dataset.

In this paper, we address the problem of detecting 3D objects from multi-view images. Current query-based methods rely on global 3D position embeddings (PE) to learn the geometric correspondence between images and 3D space. We claim that directly interacting 2D image features with global 3D PE could increase the difficulty of learning view transformation due to the variation of camera extrinsics. Thus we propose a novel method based on CAmera view Position Embedding, called CAPE. We form the 3D position embeddings under the local camera-view coordinate system instead of the global coordinate system, such that 3D position embedding is free of encoding camera extrinsic parameters. Furthermore, we extend our CAPE to temporal modeling by exploiting the object queries of previous frames and encoding the ego-motion for boosting 3D object detection. CAPE achieves state-of-the-art performance (61.0% NDS and 52.5% mAP) among all LiDAR-free methods on nuScenes dataset. Codes and models are available on \href{https://github.com/PaddlePaddle/Paddle3D}{Paddle3D} and \href{https://github.com/kaixinbear/CAPE}{PyTorch Implementation}.

Code Implementations2 repos
Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes