P2-Net: Joint Description and Detection of Local Features for Pixel and Point Matching
This addresses the challenge of directly matching pixels and points for applications like visual localization, though it is incremental as it builds on existing learning-based descriptors and detectors.
The paper tackles the problem of establishing fine-grained correspondences between 2D images and 3D point clouds by proposing a dual fully convolutional framework that jointly describes and detects keypoints in a shared latent space, achieving state-of-the-art results for indoor visual localization.
Accurately describing and detecting 2D and 3D keypoints is crucial to establishing correspondences across images and point clouds. Despite a plethora of learning-based 2D or 3D local feature descriptors and detectors having been proposed, the derivation of a shared descriptor and joint keypoint detector that directly matches pixels and points remains under-explored by the community. This work takes the initiative to establish fine-grained correspondences between 2D images and 3D point clouds. In order to directly match pixels and points, a dual fully convolutional framework is presented that maps 2D and 3D inputs into a shared latent representation space to simultaneously describe and detect keypoints. Furthermore, an ultra-wide reception mechanism in combination with a novel loss function are designed to mitigate the intrinsic information variations between pixel and point local regions. Extensive experimental results demonstrate that our framework shows competitive performance in fine-grained matching between images and point clouds and achieves state-of-the-art results for the task of indoor visual localization. Our source code will be available at [no-name-for-blind-review].