Object-Aware Centroid Voting for Monocular 3D Object Detection
This addresses the problem of expensive LiDAR dependency and computational cost in 3D object detection for autonomous driving applications, representing a strong incremental improvement.
The paper tackles monocular 3D object detection by proposing an end-to-end method that avoids dense depth estimation, using object-aware voting on 3D centroid proposals instead. It achieves state-of-the-art results on the KITTI benchmark, significantly outperforming other monocular-based methods.
Monocular 3D object detection aims to detect objects in a 3D physical world from a single camera. However, recent approaches either rely on expensive LiDAR devices, or resort to dense pixel-wise depth estimation that causes prohibitive computational cost. In this paper, we propose an end-to-end trainable monocular 3D object detector without learning the dense depth. Specifically, the grid coordinates of a 2D box are first projected back to 3D space with the pinhole model as 3D centroids proposals. Then, a novel object-aware voting approach is introduced, which considers both the region-wise appearance attention and the geometric projection distribution, to vote the 3D centroid proposals for 3D object localization. With the late fusion and the predicted 3D orientation and dimension, the 3D bounding boxes of objects can be detected from a single RGB image. The method is straightforward yet significantly superior to other monocular-based methods. Extensive experimental results on the challenging KITTI benchmark validate the effectiveness of the proposed method.