CVDec 13, 2023

C-BEV: Contrastive Bird's Eye View Training for Cross-View Image Retrieval and 3-DoF Pose Estimation

arXiv:2312.08060v113 citationsh-index: 14
AI Analysis

This work addresses the challenge of accurately geolocating street-view images in real-world settings with many-to-one ambiguities, representing a strong specific gain for applications like autonomous navigation and mapping.

The paper tackles the problem of cross-view geolocalization in real-world scenarios with varying camera poses by proposing a novel trainable retrieval architecture using bird's eye view maps, which significantly improves retrieval accuracy and learns to infer 3-DoF camera pose without explicit pose supervision. It achieves a top-1 recall increase from 31.1% to 65.0% on a challenging dataset and reduces mean pose error compared to methods trained with groundtruth.

To find the geolocation of a street-view image, cross-view geolocalization (CVGL) methods typically perform image retrieval on a database of georeferenced aerial images and determine the location from the visually most similar match. Recent approaches focus mainly on settings where street-view and aerial images are preselected to align w.r.t. translation or orientation, but struggle in challenging real-world scenarios where varying camera poses have to be matched to the same aerial image. We propose a novel trainable retrieval architecture that uses bird's eye view (BEV) maps rather than vectors as embedding representation, and explicitly addresses the many-to-one ambiguity that arises in real-world scenarios. The BEV-based retrieval is trained using the same contrastive setting and loss as classical retrieval. Our method C-BEV surpasses the state-of-the-art on the retrieval task on multiple datasets by a large margin. It is particularly effective in challenging many-to-one scenarios, e.g. increasing the top-1 recall on VIGOR's cross-area split with unknown orientation from 31.1% to 65.0%. Although the model is supervised only through a contrastive objective applied on image pairings, it additionally learns to infer the 3-DoF camera pose on the matching aerial image, and even yields a lower mean pose error than recent methods that are explicitly trained with metric groundtruth.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes