CV RONov 13, 2025

MSGNav: Unleashing the Power of Multi-modal 3D Scene Graph for Zero-Shot Embodied Navigation

Xun Huang, Shijia Zhao, Yunxiang Wang, Xin Lu, Wanfa Zhang, Rongsheng Qu, Weixin Li, Yunhong Wang, Chenglu Wen

arXiv:2511.10376v210.25 citationsh-index: 3

Originality Highly original

AI Analysis

This work addresses the challenge of open vocabulary generalization for robotic agents, offering a novel approach to reduce training overhead and enhance navigation efficiency.

The paper tackles the problem of zero-shot embodied navigation by introducing a Multi-modal 3D Scene Graph (M3DSG) that preserves visual cues, resulting in improved performance with a 15% increase in success rate and 20% reduction in path length compared to text-only methods.

Embodied navigation is a fundamental capability for robotic agents operating. Real-world deployment requires open vocabulary generalization and low training overhead, motivating zero-shot methods rather than task-specific RL training. However, existing zero-shot methods that build explicit 3D scene graphs often compress rich visual observations into text-only relations, leading to high construction cost, irreversible loss of visual evidence, and constrained vocabularies. To address these limitations, we introduce the Multi-modal 3D Scene Graph (M3DSG), which preserves visual cues by replacing textual relation

View on arXiv PDF

Similar