CVAILGMay 18, 2025

MMS-VPR: Multimodal Street-Level Visual Place Recognition Dataset and Benchmark

arXiv:2505.12254v14 citationsh-index: 1Has Code
Originality Synthesis-oriented
AI Analysis

This addresses the problem of underrepresentation in non-Western urban contexts for researchers in computer vision and geospatial understanding, though it is incremental as it builds on existing VPR datasets.

The authors tackled the lack of multimodal and street-level diversity in visual place recognition datasets by introducing MMS-VPR, a large-scale dataset with 78,575 images and 2,512 video clips from 207 locations in Chengdu, China, which showed substantial improvements in benchmarks when using multimodal and structural cues.

Existing visual place recognition (VPR) datasets predominantly rely on vehicle-mounted imagery, lack multimodal diversity and underrepresent dense, mixed-use street-level spaces, especially in non-Western urban contexts. To address these gaps, we introduce MMS-VPR, a large-scale multimodal dataset for street-level place recognition in complex, pedestrian-only environments. The dataset comprises 78,575 annotated images and 2,512 video clips captured across 207 locations in a ~70,800 $\mathrm{m}^2$ open-air commercial district in Chengdu, China. Each image is labeled with precise GPS coordinates, timestamp, and textual metadata, and covers varied lighting conditions, viewpoints, and timeframes. MMS-VPR follows a systematic and replicable data collection protocol with minimal device requirements, lowering the barrier for scalable dataset creation. Importantly, the dataset forms an inherent spatial graph with 125 edges, 81 nodes, and 1 subgraph, enabling structure-aware place recognition. We further define two application-specific subsets -- Dataset_Edges and Dataset_Points -- to support fine-grained and graph-based evaluation tasks. Extensive benchmarks using conventional VPR models, graph neural networks, and multimodal baselines show substantial improvements when leveraging multimodal and structural cues. MMS-VPR facilitates future research at the intersection of computer vision, geospatial understanding, and multimodal reasoning. The dataset is publicly available at https://huggingface.co/datasets/Yiwei-Ou/MMS-VPR.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes