ROCLCVLGDec 19, 2024

GraphEQA: Using 3D Semantic Scene Graphs for Real-time Embodied Question Answering

arXiv:2412.14480v229 citationsh-index: 22
Originality Incremental advance
AI Analysis

This addresses the problem of efficient semantic understanding and planning for robots in unseen environments, representing an incremental improvement over existing methods.

The paper tackles the challenge of Embodied Question Answering (EQA) by proposing GraphEQA, which uses real-time 3D semantic scene graphs and multi-modal memory to improve exploration and planning, resulting in higher success rates and fewer planning steps on benchmark datasets like HM-EQA and OpenEQA.

In Embodied Question Answering (EQA), agents must explore and develop a semantic understanding of an unseen environment to answer a situated question with confidence. This problem remains challenging in robotics, due to the difficulties in obtaining useful semantic representations, updating these representations online, and leveraging prior world knowledge for efficient planning and exploration. To address these limitations, we propose GraphEQA, a novel approach that utilizes real-time 3D metric-semantic scene graphs (3DSGs) and task relevant images as multi-modal memory for grounding Vision-Language Models (VLMs) to perform EQA tasks in unseen environments. We employ a hierarchical planning approach that exploits the hierarchical nature of 3DSGs for structured planning and semantics-guided exploration. We evaluate GraphEQA in simulation on two benchmark datasets, HM-EQA and OpenEQA, and demonstrate that it outperforms key baselines by completing EQA tasks with higher success rates and fewer planning steps. We further demonstrate GraphEQA in multiple real-world home and office environments.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes