Self-Supervised Feature Learning for Long-Term Metric Visual Localization
This addresses the problem of robust camera pose estimation for robotics and computer vision in changing environments, presenting an incremental improvement by removing the need for ground-truth supervision.
The paper tackles the challenge of long-term visual localization under environmental changes by proposing a self-supervised feature learning framework that generates image correspondences without ground-truth pose labels, achieving validation over 22.4 km in closed-loop experiments under varying lighting conditions.
Visual localization is the task of estimating camera pose in a known scene, which is an essential problem in robotics and computer vision. However, long-term visual localization is still a challenge due to the environmental appearance changes caused by lighting and seasons. While techniques exist to address appearance changes using neural networks, these methods typically require ground-truth pose information to generate accurate image correspondences or act as a supervisory signal during training. In this paper, we present a novel self-supervised feature learning framework for metric visual localization. We use a sequence-based image matching algorithm across different sequences of images (i.e., experiences) to generate image correspondences without ground-truth labels. We can then sample image pairs to train a deep neural network that learns sparse features with associated descriptors and scores without ground-truth pose supervision. The learned features can be used together with a classical pose estimator for visual stereo localization. We validate the learned features by integrating with an existing Visual Teach & Repeat pipeline to perform closed-loop localization experiments under different lighting conditions for a total of 22.4 km.