CVMar 11, 2025

STRMs: Spatial Temporal Reasoning Models for Vision-Based Localization Rivaling GPS Precision

arXiv:2503.07939v13.6h-index: 39

Originality Incremental advance

AI Analysis

This work addresses precise localization for robotics and autonomous vehicles, offering a computationally efficient alternative to GPS and retrieval-based methods, though it is incremental as it builds on existing generative models.

The paper tackles vision-based localization by introducing sequential generative models that transform first-person perspective observations into global map representations and precise coordinates, achieving median deviations as low as 2.29m and outperforming prior methods with an AUC of 0.777.

This paper explores vision-based localization through a biologically-inspired approach that mirrors how humans and animals link views or perspectives when navigating their world. We introduce two sequential generative models, VAE-RNN and VAE-Transformer, which transform first-person perspective (FPP) observations into global map perspective (GMP) representations and precise geographical coordinates. Unlike retrieval-based methods, our approach frames localization as a generative task, learning direct mappings between perspectives without relying on dense satellite image databases. We evaluate these models across two real-world environments: a university campus navigated by a Jackal robot and an urban downtown area navigated by a Tesla sedan. The VAE-Transformer achieves impressive precision, with median deviations of 2.29m (1.37% of environment size) and 4.45m (0.35% of environment size) respectively, outperforming both VAE-RNN and prior cross-view geo-localization approaches. Our comprehensive Localization Performance Characteristics (LPC) analysis demonstrates superior performance with the VAE-Transformer achieving an AUC of 0.777 compared to 0.295 for VIGOR 200 and 0.225 for TransGeo, establishing a new state-of-the-art in vision-based localization. In some scenarios, our vision-based system rivals commercial smartphone GPS accuracy (AUC of 0.797) while requiring 5x less GPU memory and delivering 3x faster inference than existing methods in cross-view geo-localization. These results demonstrate that models inspired by biological spatial navigation can effectively memorize complex, dynamic environments and provide precise localization with minimal computational resources.

View on arXiv PDF

Similar