The Bare Necessities: Designing Simple, Effective Open-Vocabulary Scene Graphs
This work addresses computational bottlenecks in scene graph representations for embodied agents, offering an incremental improvement in efficiency.
The paper tackles the computational inefficiency of 3D open-vocabulary scene graph methods for embodied agents by reexamining design choices, revealing that common techniques like image pre-processing triple computation with minimal gains. It proposes a balanced approach that matches state-of-the-art classification accuracy while achieving a threefold reduction in computation.
3D open-vocabulary scene graph methods are a promising map representation for embodied agents, however many current approaches are computationally expensive. In this paper, we reexamine the critical design choices established in previous works to optimize both efficiency and performance. We propose a general scene graph framework and conduct three studies that focus on image pre-processing, feature fusion, and feature selection. Our findings reveal that commonly used image pre-processing techniques provide minimal performance improvement while tripling computation (on a per object view basis). We also show that averaging feature labels across different views significantly degrades performance. We study alternative feature selection strategies that enhance performance without adding unnecessary computational costs. Based on our findings, we introduce a computationally balanced approach for 3D point cloud segmentation with per-object features. The approach matches state-of-the-art classification accuracy while achieving a threefold reduction in computation.