Boosting 3D Object Detection by Simulating Multimodality on Point Clouds
This work addresses the performance gap in 3D object detection for autonomous driving systems by enabling LiDAR-only models to approach multi-modality accuracy without requiring images at inference, representing a strong incremental advance.
The paper tackles the problem of improving single-modality (LiDAR) 3D object detection by simulating features from multi-modality (LiDAR-image) detectors, achieving results that outperform state-of-the-art LiDAR-only detectors and fill 72% of the mAP gap between single- and multi-modality detectors on the nuScenes dataset.
This paper presents a new approach to boost a single-modality (LiDAR) 3D object detector by teaching it to simulate features and responses that follow a multi-modality (LiDAR-image) detector. The approach needs LiDAR-image data only when training the single-modality detector, and once well-trained, it only needs LiDAR data at inference. We design a novel framework to realize the approach: response distillation to focus on the crucial response samples and avoid the background samples; sparse-voxel distillation to learn voxel semantics and relations from the estimated crucial voxels; a fine-grained voxel-to-point distillation to better attend to features of small and distant objects; and instance distillation to further enhance the deep-feature consistency. Experimental results on the nuScenes dataset show that our approach outperforms all SOTA LiDAR-only 3D detectors and even surpasses the baseline LiDAR-image detector on the key NDS metric, filling 72% mAP gap between the single- and multi-modality detectors.