PointSee: Image Enhances Point Cloud
This work addresses multi-modal fusion challenges for 3D object detection in autonomous driving and robotics, but it appears incremental as it builds on existing 3OD networks with new modules.
The paper tackles the problems of low lightweightness, poor flexibility, and inaccurate alignment in multi-modal fusion for 3D object detection by proposing PointSee, a solution that enhances LiDAR point clouds with image features, resulting in numerical improvements over twenty-two state-of-the-art methods on benchmarks.
There is a trend to fuse multi-modal information for 3D object detection (3OD). However, the challenging problems of low lightweightness, poor flexibility of plug-and-play, and inaccurate alignment of features are still not well-solved, when designing multi-modal fusion newtorks. We propose PointSee, a lightweight, flexible and effective multi-modal fusion solution to facilitate various 3OD networks by semantic feature enhancement of LiDAR point clouds assembled with scene images. Beyond the existing wisdom of 3OD, PointSee consists of a hidden module (HM) and a seen module (SM): HM decorates LiDAR point clouds using 2D image information in an offline fusion manner, leading to minimal or even no adaptations of existing 3OD networks; SM further enriches the LiDAR point clouds by acquiring point-wise representative semantic features, leading to enhanced performance of existing 3OD networks. Besides the new architecture of PointSee, we propose a simple yet efficient training strategy, to ease the potential inaccurate regressions of 2D object detection networks. Extensive experiments on the popular outdoor/indoor benchmarks show numerical improvements of our PointSee over twenty-two state-of-the-arts.