CVAISep 1, 2025

Street-Level Geolocalization Using Multimodal Large Language Models and Retrieval-Augmented Generation

arXiv:2509.01341v1h-index: 15
Originality Highly original
AI Analysis

This provides a more accessible and scalable solution for navigation, location-based services, and urban planning by addressing challenges in traditional computer vision methods.

The paper tackles street-level geolocalization from images by integrating multimodal large language models with retrieval-augmented generation, achieving state-of-the-art accuracy on benchmark datasets like IM2GPS, IM2GPS3k, and YFCC4k without requiring fine-tuning or retraining.

Street-level geolocalization from images is crucial for a wide range of essential applications and services, such as navigation, location-based recommendations, and urban planning. With the growing popularity of social media data and cameras embedded in smartphones, applying traditional computer vision techniques to localize images has become increasingly challenging, yet highly valuable. This paper introduces a novel approach that integrates open-weight and publicly accessible multimodal large language models with retrieval-augmented generation. The method constructs a vector database using the SigLIP encoder on two large-scale datasets (EMP-16 and OSV-5M). Query images are augmented with prompts containing both similar and dissimilar geolocation information retrieved from this database before being processed by the multimodal large language models. Our approach has demonstrated state-of-the-art performance, achieving higher accuracy compared against three widely used benchmark datasets (IM2GPS, IM2GPS3k, and YFCC4k). Importantly, our solution eliminates the need for expensive fine-tuning or retraining and scales seamlessly to incorporate new data sources. The effectiveness of retrieval-augmented generation-based multimodal large language models in geolocation estimation demonstrated by this paper suggests an alternative path to the traditional methods which rely on the training models from scratch, opening new possibilities for more accessible and scalable solutions in GeoAI.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes