CVAIJul 6, 2025

MVL-Loc: Leveraging Vision-Language Model for Generalizable Multi-Scene Camera Relocalization

arXiv:2507.04509v11 citationsh-index: 2Appl Sci
Originality Highly original
AI Analysis

This addresses the generalization and robustness limitations of existing camera relocalization methods for applications like AR, autonomous driving, and robotics.

The paper tackles the problem of camera relocalization across diverse environments by proposing MVL-Loc, a framework that leverages vision-language models and multimodal data to generalize across indoor and outdoor scenes, achieving state-of-the-art performance on benchmark datasets with improved positional and orientational accuracy.

Camera relocalization, a cornerstone capability of modern computer vision, accurately determines a camera's position and orientation (6-DoF) from images and is essential for applications in augmented reality (AR), mixed reality (MR), autonomous driving, delivery drones, and robotic navigation. Unlike traditional deep learning-based methods that regress camera pose from images in a single scene, which often lack generalization and robustness in diverse environments, we propose MVL-Loc, a novel end-to-end multi-scene 6-DoF camera relocalization framework. MVL-Loc leverages pretrained world knowledge from vision-language models (VLMs) and incorporates multimodal data to generalize across both indoor and outdoor settings. Furthermore, natural language is employed as a directive tool to guide the multi-scene learning process, facilitating semantic understanding of complex scenes and capturing spatial relationships among objects. Extensive experiments on the 7Scenes and Cambridge Landmarks datasets demonstrate MVL-Loc's robustness and state-of-the-art performance in real-world multi-scene camera relocalization, with improved accuracy in both positional and orientational estimates.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes