CVMar 10

VLM-Loc: Localization in Point Cloud Maps via Vision-Language Models

arXiv:2603.09826v141.72 citationsh-index: 8Has Code
Predicted impact top 17% in CV · last 90 daysOriginality Highly original
AI Analysis

It addresses localization in complex 3D environments for applications like robotics and autonomous navigation, representing an incremental improvement with a novel method for a known bottleneck.

The paper tackles text-to-point-cloud localization by proposing VLM-Loc, which uses vision-language models for spatial reasoning, achieving superior accuracy and robustness on the new CityLoc benchmark compared to state-of-the-art methods.

Text-to-point-cloud (T2P) localization aims to infer precise spatial positions within 3D point cloud maps from natural language descriptions, reflecting how humans perceive and communicate spatial layouts through language. However, existing methods largely rely on shallow text-point cloud correspondence without effective spatial reasoning, limiting their accuracy in complex environments. To address this limitation, we propose VLM-Loc, a framework that leverages the spatial reasoning capability of large vision-language models (VLMs) for T2P localization. Specifically, we transform point clouds into bird's-eye-view (BEV) images and scene graphs that jointly encode geometric and semantic context, providing structured inputs for the VLM to learn cross-modal representations bridging linguistic and spatial semantics. On top of these representations, we introduce a partial node assignment mechanism that explicitly associates textual cues with scene graph nodes, enabling interpretable spatial reasoning for accurate localization. To facilitate systematic evaluation across diverse scenes, we present CityLoc, a benchmark built from multi-source point clouds for fine-grained T2P localization. Experiments on CityLoc demonstrate VLM-Loc achieves superior accuracy and robustness compared to state-of-the-art methods. Our code, model, and dataset are available at \href{https://github.com/MCG-NKU/nku-3d-vision}{repository}.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes