CVNov 19, 2022

Sparse4D: Multi-view 3D Object Detection with Sparse Spatial-Temporal Fusion

arXiv:2211.10581v2189 citationsh-index: 24
Originality Highly original
AI Analysis

This work addresses the performance gap in sparse-based 3D detection for autonomous driving applications, offering a more efficient alternative to dense methods that is better suited for edge device deployment.

The paper tackles multi-view 3D object detection by introducing Sparse4D, a method that refines anchor boxes through sparse spatial-temporal feature sampling and fusion, achieving state-of-the-art performance among sparse-based methods and competitive results with BEV-based methods on the nuScenes dataset.

Bird-eye-view (BEV) based methods have made great progress recently in multi-view 3D detection task. Comparing with BEV based methods, sparse based methods lag behind in performance, but still have lots of non-negligible merits. To push sparse 3D detection further, in this work, we introduce a novel method, named Sparse4D, which does the iterative refinement of anchor boxes via sparsely sampling and fusing spatial-temporal features. (1) Sparse 4D Sampling: for each 3D anchor, we assign multiple 4D keypoints, which are then projected to multi-view/scale/timestamp image features to sample corresponding features; (2) Hierarchy Feature Fusion: we hierarchically fuse sampled features of different view/scale, different timestamp and different keypoints to generate high-quality instance feature. In this way, Sparse4D can efficiently and effectively achieve 3D detection without relying on dense view transformation nor global attention, and is more friendly to edge devices deployment. Furthermore, we introduce an instance-level depth reweight module to alleviate the ill-posed issue in 3D-to-2D projection. In experiment, our method outperforms all sparse based methods and most BEV based methods on detection task in the nuScenes dataset.

Code Implementations1 repo
Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes