CVROJul 22, 2024

Learning High-resolution Vector Representation from Multi-Camera Images for 3D Object Detection

arXiv:2407.15354v17 citationsh-index: 28
Originality Incremental advance
AI Analysis

This work addresses a computational bottleneck in camera-based 3D object detection for autonomous driving, offering an incremental improvement over existing methods.

The paper tackles the computational inefficiency of high-resolution Bird's-Eye-View (BEV) representations in 3D object detection by introducing VectorFormer, which combines high-resolution vector representations with lower-resolution BEV grids, achieving state-of-the-art performance on the nuScenes dataset in terms of NDS and inference time.

The Bird's-Eye-View (BEV) representation is a critical factor that directly impacts the 3D object detection performance, but the traditional BEV grid representation induces quadratic computational cost as the spatial resolution grows. To address this limitation, we present a new camera-based 3D object detector with high-resolution vector representation: VectorFormer. The presented high-resolution vector representation is combined with the lower-resolution BEV representation to efficiently exploit 3D geometry from multi-camera images at a high resolution through our two novel modules: vector scattering and gathering. To this end, the learned vector representation with richer scene contexts can serve as the decoding query for final predictions. We conduct extensive experiments on the nuScenes dataset and demonstrate state-of-the-art performance in NDS and inference time. Furthermore, we investigate query-BEV-based methods incorporated with our proposed vector representation and observe a consistent performance improvement.

Code Implementations1 repo
Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes