Open-Vocabulary Indoor Object Grounding with 3D Hierarchical Scene Graph
This addresses the challenge of spatial reasoning and understanding in indoor environments for applications like robotics or augmented reality, representing an incremental advance by integrating existing foundation models with a novel hierarchical structure.
The paper tackles the problem of open-vocabulary object grounding in indoor environments by proposing OVIGo-3DHSG, a method that uses a 3D hierarchical scene graph and large language models for multistep reasoning, demonstrating efficient scene comprehension and robust object grounding compared to existing methods.
We propose OVIGo-3DHSG method - Open-Vocabulary Indoor Grounding of objects using 3D Hierarchical Scene Graph. OVIGo-3DHSG represents an extensive indoor environment over a Hierarchical Scene Graph derived from sequences of RGB-D frames utilizing a set of open-vocabulary foundation models and sensor data processing. The hierarchical representation explicitly models spatial relations across floors, rooms, locations, and objects. To effectively address complex queries involving spatial reference to other objects, we integrate the hierarchical scene graph with a Large Language Model for multistep reasoning. This integration leverages inter-layer (e.g., room-to-object) and intra-layer (e.g., object-to-object) connections, enhancing spatial contextual understanding. We investigate the semantic and geometry accuracy of hierarchical representation on Habitat Matterport 3D Semantic multi-floor scenes. Our approach demonstrates efficient scene comprehension and robust object grounding compared to existing methods. Overall OVIGo-3DHSG demonstrates strong potential for applications requiring spatial reasoning and understanding of indoor environments. Related materials can be found at https://github.com/linukc/OVIGo-3DHSG.