CVNov 8, 2025

Open-World 3D Scene Graph Generation for Retrieval-Augmented Reasoning

Fei Yu, Quan Deng, Shengeng Tang, Yuehua Li, Lechao Cheng

arXiv:2511.05894v110.23 citationsh-index: 3

Originality Incremental advance

AI Analysis

This work addresses the challenge of open-world 3D scene understanding for vision and robotics, offering a scalable solution that combines open-vocabulary perception with retrieval-based reasoning, though it appears incremental as it builds on existing methods like VLMs and retrieval systems.

The paper tackles the problem of understanding 3D scenes in open-world settings by proposing a unified framework for Open-World 3D Scene Graph Generation with Retrieval-Augmented Reasoning, which integrates Vision-Language Models with retrieval-based reasoning to enable generalizable and interactive 3D scene understanding, demonstrating robust generalization and superior performance on benchmarks like 3DSSG and Replica across tasks such as scene question answering and visual grounding.

Understanding 3D scenes in open-world settings poses fundamental challenges for vision and robotics, particularly due to the limitations of closed-vocabulary supervision and static annotations. To address this, we propose a unified framework for Open-World 3D Scene Graph Generation with Retrieval-Augmented Reasoning, which enables generalizable and interactive 3D scene understanding. Our method integrates Vision-Language Models (VLMs) with retrieval-based reasoning to support multimodal exploration and language-guided interaction. The framework comprises two key components: (1) a dynamic scene graph generation module that detects objects and infers semantic relationships without fixed label sets, and (2) a retrieval-augmented reasoning pipeline that encodes scene graphs into a vector database to support text/image-conditioned queries. We evaluate our method on 3DSSG and Replica benchmarks across four tasks-scene question answering, visual grounding, instance retrieval, and task planning-demonstrating robust generalization and superior performance in diverse environments. Our results highlight the effectiveness of combining open-vocabulary perception with retrieval-based reasoning for scalable 3D scene understanding.

View on arXiv PDF

Similar