VoxelNet: End-to-End Learning for Point Cloud Based 3D Object Detection
This addresses the need for accurate 3D object detection in applications like autonomous navigation, offering a novel method that improves over existing approaches.
The paper tackles the problem of 3D object detection in sparse LiDAR point clouds by proposing VoxelNet, an end-to-end deep network that eliminates manual feature engineering and unifies feature extraction and bounding box prediction. It achieves state-of-the-art results on the KITTI car detection benchmark and shows promising performance for pedestrians and cyclists.
Accurate detection of objects in 3D point clouds is a central problem in many applications, such as autonomous navigation, housekeeping robots, and augmented/virtual reality. To interface a highly sparse LiDAR point cloud with a region proposal network (RPN), most existing efforts have focused on hand-crafted feature representations, for example, a bird's eye view projection. In this work, we remove the need of manual feature engineering for 3D point clouds and propose VoxelNet, a generic 3D detection network that unifies feature extraction and bounding box prediction into a single stage, end-to-end trainable deep network. Specifically, VoxelNet divides a point cloud into equally spaced 3D voxels and transforms a group of points within each voxel into a unified feature representation through the newly introduced voxel feature encoding (VFE) layer. In this way, the point cloud is encoded as a descriptive volumetric representation, which is then connected to a RPN to generate detections. Experiments on the KITTI car detection benchmark show that VoxelNet outperforms the state-of-the-art LiDAR based 3D detection methods by a large margin. Furthermore, our network learns an effective discriminative representation of objects with various geometries, leading to encouraging results in 3D detection of pedestrians and cyclists, based on only LiDAR.