CVNov 14, 2022

Recursive Cross-View: Use Only 2D Detectors to Achieve 3D Object Detection without 3D Annotations

arXiv:2211.07108v31 citationsh-index: 23
Originality Highly original
AI Analysis

This addresses the limitation of heavy reliance on 3D annotations for real-world applications in 3D object detection, offering a novel approach that can also serve as a semi-automatic 3D annotator.

The paper tackles the problem of 3D object detection without requiring 3D annotations by proposing Recursive Cross-View (RCV), which converts 3D detection into multiple 2D detection tasks using a recursive paradigm; it outperforms existing image-based methods on SUN RGB-D and KITTI datasets and achieves 7 fps on a live RGB-D stream.

Heavily relying on 3D annotations limits the real-world application of 3D object detection. In this paper, we propose a method that does not demand any 3D annotation, while being able to predict fully oriented 3D bounding boxes. Our method, called Recursive Cross-View (RCV), utilizes the three-view principle to convert 3D detection into multiple 2D detection tasks, requiring only a subset of 2D labels. We propose a recursive paradigm, in which instance segmentation and 3D bounding box generation by Cross-View are implemented recursively until convergence. Specifically, our proposed method involves the use of a frustum for each 2D bounding box, which is then followed by the recursive paradigm that ultimately generates a fully oriented 3D box, along with its corresponding class and score. Note that, class and score are given by the 2D detector. Estimated on the SUN RGB-D and KITTI datasets, our method outperforms existing image-based approaches. To justify that our method can be quickly used to new tasks, we implement it on two real-world scenarios, namely 3D human detection and 3D hand detection. As a result, two new 3D annotated datasets are obtained, which means that RCV can be viewed as a (semi-) automatic 3D annotator. Furthermore, we deploy RCV on a depth sensor, which achieves detection at 7 fps on a live RGB-D stream. RCV is the first 3D detection method that yields fully oriented 3D boxes without consuming 3D labels.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes