Spatially-Weighted CLIP for Street-View Geo-localization
This addresses geo-localization challenges for applications like mapping and navigation by shifting from semantic to geographic alignment, representing an incremental advance with a novel method for a known bottleneck.
The paper tackled the problem of street-view geo-localization by proposing SW-CLIP, which incorporates spatial autocorrelation into vision-language contrastive learning, resulting in significant improvements in accuracy, reduced long-tail errors, and enhanced spatial coherence compared to standard CLIP.
This paper proposes Spatially-Weighted CLIP (SW-CLIP), a novel framework for street-view geo-localization that explicitly incorporates spatial autocorrelation into vision-language contrastive learning. Unlike conventional CLIP-based methods that treat all non-matching samples as equally negative, SW-CLIP leverages Tobler's First Law of Geography to model geographic relationships through distance-aware soft supervision. Specifically, we introduce a location-as-text representation to encode geographic positions and replace one-hot InfoNCE targets with spatially weighted soft labels derived from geodesic distance. Additionally, a neighborhood-consistency regularization is employed to preserve local spatial structure in the embedding space. Experiments on a multi-city dataset demonstrate that SW-CLIP significantly improves geo-localization accuracy, reduces long-tail errors, and enhances spatial coherence compared to standard CLIP. The results highlight the importance of shifting from semantic alignment to geographic alignment for robust geo-localization and provide a general paradigm for integrating spatial principles into multimodal representation learning.