CVJun 16, 2025

FreeQ-Graph: Free-form Querying with Semantic Consistent Scene Graph for 3D Scene Understanding

arXiv:2506.13629v24 citationsh-index: 5
Originality Incremental advance
AI Analysis

This addresses the problem of enabling flexible, language-based interaction with 3D environments for applications like robotics or AR/VR, though it builds incrementally on existing LLM and scene graph methods.

The paper tackles the challenge of free-form semantic querying in 3D scenes by proposing FreeQ-Graph, which constructs a semantic-consistent scene graph without predefined vocabularies and aligns it with 3D semantic labels, achieving state-of-the-art performance on 6 datasets for tasks like semantic grounding and complex querying.

Semantic querying in complex 3D scenes through free-form language presents a significant challenge. Existing 3D scene understanding methods use large-scale training data and CLIP to align text queries with 3D semantic features. However, their reliance on predefined vocabulary priors from training data hinders free-form semantic querying. Besides, recent advanced methods rely on LLMs for scene understanding but lack comprehensive 3D scene-level information and often overlook the potential inconsistencies in LLM-generated outputs. In our paper, we propose FreeQ-Graph, which enables Free-form Querying with a semantic consistent scene Graph for 3D scene understanding. The core idea is to encode free-form queries from a complete and accurate 3D scene graph without predefined vocabularies, and to align them with 3D consistent semantic labels, which accomplished through three key steps. We initiate by constructing a complete and accurate 3D scene graph that maps free-form objects and their relations through LLM and LVLM guidance, entirely free from training data or predefined priors. Most importantly, we align graph nodes with accurate semantic labels by leveraging 3D semantic aligned features from merged superpoints, enhancing 3D semantic consistency. To enable free-form semantic querying, we then design an LLM-based reasoning algorithm that combines scene-level and object-level information to intricate reasoning. We conducted extensive experiments on 3D semantic grounding, segmentation, and complex querying tasks, while also validating the accuracy of graph generation. Experiments on 6 datasets show that our model excels in both complex free-form semantic queries and intricate relational reasoning.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes