Deep Keypoint-Based Camera Pose Estimation with Geometric Constraints
This addresses the challenge of making learning-based camera pose estimation as accurate and robust as conventional methods for applications in robotics and augmented reality, though it is incremental as it builds on existing learning-based approaches.
The paper tackles the problem of estimating relative camera poses from consecutive frames in visual odometry and SLAM by designing an end-to-end trainable framework with learnable modules for detection, feature extraction, matching, and outlier rejection, directly optimizing for geometric pose, achieving performance on par with classic methods and improving generalizability to unseen datasets.
Estimating relative camera poses from consecutive frames is a fundamental problem in visual odometry (VO) and simultaneous localization and mapping (SLAM), where classic methods consisting of hand-crafted features and sampling-based outlier rejection have been a dominant choice for over a decade. Although multiple works propose to replace these modules with learning-based counterparts, most have not yet been as accurate, robust and generalizable as conventional methods. In this paper, we design an end-to-end trainable framework consisting of learnable modules for detection, feature extraction, matching and outlier rejection, while directly optimizing for the geometric pose objective. We show both quantitatively and qualitatively that pose estimation performance may be achieved on par with the classic pipeline. Moreover, we are able to show by end-to-end training, the key components of the pipeline could be significantly improved, which leads to better generalizability to unseen datasets compared to existing learning-based methods.