CVJan 13, 2025

VAGeo: View-specific Attention for Cross-View Object Geo-Localization

arXiv:2501.07194v110 citationsh-index: 5ICASSP

Originality Incremental advance

AI Analysis

This work solves the problem of accurately locating objects in satellite images from ground or drone views for applications like surveillance or mapping, but it is incremental as it builds on existing CVOGL methods by adding viewpoint-specific adaptations.

The paper tackles cross-view object geo-localization by addressing viewpoint discrepancies between ground- and drone-view query images, proposing VAGeo with view-specific positional encoding and hybrid attention modules, which improves accuracy from 45.43%/42.24% to 48.21%/45.22% for ground-view and from 61.97%/57.66% to 66.19%/61.87% for drone-view on the CVOGL dataset.

Cross-view object geo-localization (CVOGL) aims to locate an object of interest in a captured ground- or drone-view image within the satellite image. However, existing works treat ground-view and drone-view query images equivalently, overlooking their inherent viewpoint discrepancies and the spatial correlation between the query image and the satellite-view reference image. To this end, this paper proposes a novel View-specific Attention Geo-localization method (VAGeo) for accurate CVOGL. Specifically, VAGeo contains two key modules: view-specific positional encoding (VSPE) module and channel-spatial hybrid attention (CSHA) module. In object-level, according to the characteristics of different viewpoints of ground and drone query images, viewpoint-specific positional codings are designed to more accurately identify the click-point object of the query image in the VSPE module. In feature-level, a hybrid attention in the CSHA module is introduced by combining channel attention and spatial attention mechanisms simultaneously for learning discriminative features. Extensive experimental results demonstrate that the proposed VAGeo gains a significant performance improvement, i.e., improving acc@0.25/acc@0.5 on the CVOGL dataset from 45.43%/42.24% to 48.21%/45.22% for ground-view, and from 61.97%/57.66% to 66.19%/61.87% for drone-view.

View on arXiv PDF

Similar