Textual Supervision Enhances Geospatial Representations in Vision-Language Models
For researchers in geospatial AI and multimodal learning, this work identifies systematic gaps in spatial accuracy and shows that textual supervision improves geospatial representations, though the findings are observational rather than proposing a new method.
The paper analyzes geospatial representations in vision-only, vision-language, and multimodal models, finding that textual supervision improves spatial accuracy across image clusters. It demonstrates that language enhances geospatial learning, suggesting multimodal approaches are key for geospatial AI.
Geospatial understanding is a critical yet underexplored dimension in the development of machine learning systems for tasks such as image geolocation and spatial reasoning. In this work, we analyze the geospatial representations acquired by three model families: vision-only architectures (e.g., ViT), vision-language models (e.g., CLIP), and large-scale multimodal foundation models (e.g., LLaVA, Qwen, and Gemma). By evaluating across image clusters, including people, landmarks, and everyday objects, grouped based on the degree of localizability, we reveal systematic gaps in spatial accuracy and show that textual supervision enhances the learning of geospatial representations. Our findings suggest the role of language as an effective complementary modality for encoding spatial context and multimodal learning as a key direction for advancing geospatial AI.