C-BEV: Contrastive Bird's Eye View Training for Cross-View Image Retrieval and 3-DoF Pose Estimation
This work addresses the challenge of accurately geolocating street-view images in real-world settings with many-to-one ambiguities, representing a strong specific gain for applications like autonomous navigation and mapping.
The paper tackles the problem of cross-view geolocalization in real-world scenarios with varying camera poses by proposing a novel trainable retrieval architecture using bird's eye view maps, which significantly improves retrieval accuracy and learns to infer 3-DoF camera pose without explicit pose supervision. It achieves a top-1 recall increase from 31.1% to 65.0% on a challenging dataset and reduces mean pose error compared to methods trained with groundtruth.
To find the geolocation of a street-view image, cross-view geolocalization (CVGL) methods typically perform image retrieval on a database of georeferenced aerial images and determine the location from the visually most similar match. Recent approaches focus mainly on settings where street-view and aerial images are preselected to align w.r.t. translation or orientation, but struggle in challenging real-world scenarios where varying camera poses have to be matched to the same aerial image. We propose a novel trainable retrieval architecture that uses bird's eye view (BEV) maps rather than vectors as embedding representation, and explicitly addresses the many-to-one ambiguity that arises in real-world scenarios. The BEV-based retrieval is trained using the same contrastive setting and loss as classical retrieval. Our method C-BEV surpasses the state-of-the-art on the retrieval task on multiple datasets by a large margin. It is particularly effective in challenging many-to-one scenarios, e.g. increasing the top-1 recall on VIGOR's cross-area split with unknown orientation from 31.1% to 65.0%. Although the model is supervised only through a contrastive objective applied on image pairings, it additionally learns to infer the 3-DoF camera pose on the matching aerial image, and even yields a lower mean pose error than recent methods that are explicitly trained with metric groundtruth.