Delving into the Pre-training Paradigm of Monocular 3D Object Detection
This work addresses the label scarcity issue for researchers and practitioners in autonomous driving by introducing a novel pre-training approach, though it is incremental as it builds on existing depth estimation and 2D detection methods.
The paper tackles the problem of expensive labels in monocular 3D object detection by proposing a pre-training paradigm that uses unlabeled data, resulting in significant performance improvements such as an 18.71% boost in AP3D70 on KITTI-3D and a 40.41% relative improvement in NDS on nuScenes.
The labels of monocular 3D object detection (M3OD) are expensive to obtain. Meanwhile, there usually exists numerous unlabeled data in practical applications, and pre-training is an efficient way of exploiting the knowledge in unlabeled data. However, the pre-training paradigm for M3OD is hardly studied. We aim to bridge this gap in this work. To this end, we first draw two observations: (1) The guideline of devising pre-training tasks is imitating the representation of the target task. (2) Combining depth estimation and 2D object detection is a promising M3OD pre-training baseline. Afterwards, following the guideline, we propose several strategies to further improve this baseline, which mainly include target guided semi-dense depth estimation, keypoint-aware 2D object detection, and class-level loss adjustment. Combining all the developed techniques, the obtained pre-training framework produces pre-trained backbones that improve M3OD performance significantly on both the KITTI-3D and nuScenes benchmarks. For example, by applying a DLA34 backbone to a naive center-based M3OD detector, the moderate ${\rm AP}_{3D}70$ score of Car on the KITTI-3D testing set is boosted by 18.71\% and the NDS score on the nuScenes validation set is improved by 40.41\% relatively.