RelMobNet: End-to-end relative camera pose estimation using a robust two-stage training
This work addresses a key problem in augmented reality and robotics by enhancing pose estimation accuracy, though it is incremental as it builds on prior CNN-based methods.
The paper tackled relative camera pose estimation by proposing an end-to-end network with a novel two-stage training method to improve generalization without hyperparameter tuning, resulting in translation vector estimation improvements of up to 52.27% on specific scenes compared to existing methods.
Relative camera pose estimation, i.e. estimating the translation and rotation vectors using a pair of images taken in different locations, is an important part of systems in augmented reality and robotics. In this paper, we present an end-to-end relative camera pose estimation network using a siamese architecture that is independent of camera parameters. The network is trained using the Cambridge Landmarks data with four individual scene datasets and a dataset combining the four scenes. To improve generalization, we propose a novel two-stage training that alleviates the need of a hyperparameter to balance the translation and rotation loss scale. The proposed method is compared with one-stage training CNN-based methods such as RPNet and RCPNet and demonstrate that the proposed model improves translation vector estimation by 16.11%, 28.88%, and 52.27% on the Kings College, Old Hospital, and St Marys Church scenes, respectively. For proving texture invariance, we investigate the generalization of the proposed method augmenting the datasets to different scene styles, as ablation studies, using generative adversarial networks. Also, we present a qualitative assessment of epipolar lines of our network predictions and ground truth poses.