CVDec 19, 2025

MMLANDMARKS: a Cross-View Instance-Level Benchmark for Geo-Spatial Understanding

arXiv:2512.17492v1h-index: 14
Originality Synthesis-oriented
AI Analysis

This addresses the need for multimodal datasets in geo-spatial analysis, though it is incremental as it builds on existing benchmarks by adding more modalities and data.

The authors tackled the problem of limited multimodal coverage in geo-spatial benchmarks by introducing the MMLANDMARKS dataset, which includes 197k aerial images, 329k ground-view images, text, and coordinates for 18,557 landmarks, enabling tasks like cross-view retrieval and geolocalization with competitive performance against state-of-the-art models.

Geo-spatial analysis of our world benefits from a multimodal approach, as every single geographic location can be described in numerous ways (images from various viewpoints, textual descriptions, and geographic coordinates). Current geo-spatial benchmarks have limited coverage across modalities, considerably restricting progress in the field, as current approaches cannot integrate all relevant modalities within a unified framework. We introduce the Multi-Modal Landmark dataset (MMLANDMARKS), a benchmark composed of four modalities: 197k highresolution aerial images, 329k ground-view images, textual information, and geographic coordinates for 18,557 distinct landmarks in the United States. The MMLANDMARKS dataset has a one-to-one correspondence across every modality, which enables training and benchmarking models for various geo-spatial tasks, including cross-view Ground-to-Satellite retrieval, ground and satellite geolocalization, Text-to-Image, and Text-to-GPS retrieval. We demonstrate broad generalization and competitive performance against off-the-shelf foundational models and specialized state-of-the-art models across different tasks by employing a simple CLIP-inspired baseline, illustrating the necessity for multimodal datasets to achieve broad geo-spatial understanding.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes