ROCVMay 15

Hierarchical and Holistic Open-Vocabulary Functional 3D Scene Graphs for Indoor Spaces

arXiv:2605.1575324.8
Predicted impact top 23% in RO · last 90 daysOriginality Incremental advance
AI Analysis

For robotics and scene understanding, it addresses the lack of hierarchical and small-object coverage in existing benchmarks, enabling more detailed functional reasoning.

This work extends functional 3D scene graphs to include dense tabletop objects and hierarchical relationships, proposing an open-vocabulary pipeline using 2D visual grounding and 3D graph optimization that reliably infers graphs in challenging real-world scenes.

Functional 3D scene graphs offer a versatile and flexible representation for 3D scene understanding and robotic manipulation, defined by object nodes, interactive elements, and functional relationship edges. However, their potential remains underexplored due to the limited coverage of existing benchmarks and the overly straightforward design of previous pipelines, which primarily focus on large-scale furniture but lack of hierarchical structures. Therefore, in this work, we extend the benchmark coverage by introducing dense tabletop objects and explicit multi-level functional relationships. This expansion introduces critical challenges involving small-scale, dense, and similar instances, with lack of visual anchoring in relational reasoning, instance confusion during cross-frame fusion, and attribution uncertainty under dynamic viewpoints. To address these issues, we propose an open-vocabulary pipeline based on 2D visual grounding and 3D graph optimization. Specifically, we anchor fine-grained functional edges from 2D visual evidence, and associate nodes across frames in 3D using multiple cues. Furthermore, edge association is formulated as temporal graph optimization, integrating evidence accumulation, entropy regularization, and temporal smoothing to robustly determine the functional connections of each node. Finally, global hierarchy shaping is performed to recover the hierarchical graph structure. Extensive experiments demonstrate that the proposed method can reliably infer functional 3D scene graphs in challenging real-world scenes, thereby further unlocking their potential for practical applications.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes