SDVRF: Sparse-to-Dense Voxel Region Fusion for Multi-modal 3D Object Detection
This work addresses perception challenges in autonomous driving by enhancing multi-modal fusion, though it is incremental as it builds on existing methods.
The paper tackles the sparsity of LiDAR point clouds and misalignment noise in multi-modal 3D object detection by introducing Voxel Regions and a Sparse-to-Dense Voxel Region Fusion method, achieving improved performance on the KITTI dataset, particularly for small objects like Pedestrian and Cyclist.
In the perception task of autonomous driving, multi-modal methods have become a trend due to the complementary characteristics of LiDAR point clouds and image data. However, the performance of multi-modal methods is usually limited by the sparsity of the point cloud or the noise problem caused by the misalignment between LiDAR and the camera. To solve these two problems, we present a new concept, Voxel Region (VR), which is obtained by projecting the sparse local point clouds in each voxel dynamically. And we propose a novel fusion method named Sparse-to-Dense Voxel Region Fusion (SDVRF). Specifically, more pixels of the image feature map inside the VR are gathered to supplement the voxel feature extracted from sparse points and achieve denser fusion. Meanwhile, different from prior methods, which project the size-fixed grids, our strategy of generating dynamic regions achieves better alignment and avoids introducing too much background noise. Furthermore, we propose a multi-scale fusion framework to extract more contextual information and capture the features of objects of different sizes. Experiments on the KITTI dataset show that our method improves the performance of different baselines, especially on classes of small size, including Pedestrian and Cyclist.