CV ROMay 11

OpenSGA: Efficient 3D Scene Graph Alignment in the Open World

Gang Chen, Sebastián Barbas Laina, Stefan Leutenegger, Javier Alonso-Mora

arXiv:2605.1048429.5

AI Analysis

For robotics applications requiring long-term memory and multi-agent map fusion, this work provides a more robust and scalable alignment method that leverages open-set vision-language features, addressing limitations of prior geometric-only approaches.

OpenSGA introduces a unified framework for 3D scene graph alignment that fuses vision-language, textual, and geometric features, achieving state-of-the-art performance on both frame-to-scan and subscan-to-subscan tasks, with a new large-scale dataset (ScanNet-SG) containing over 700k samples across 509+ categories.

Scene graph alignment establishes object correspondences between two 3D scene graphs constructed from partially overlapping observations. This enables efficient scene understanding and object-level relocalization when a robot revisits a place, as well as global map fusion across multiple agents. Such capabilities are essential for robots that require long-term memory for long-horizon tasks involving interactions with the environment. Existing approaches mainly focus on subscan-to-subscan (S2S) alignment and depend heavily on geometric point-cloud features, leaving frame-to-scan (F2S) alignment and open-set vision-language features underexplored. In addition, existing datasets for scene graph alignment remain small-scale with limited object diversity, constraining systematic training and evaluation. We present a unified and efficient scene graph alignment framework that predicts object correspondences by fusing vision-language, textual, and geometric features with spatial context. The framework comprises modules such as a distance-gated spatial attention encoder, a minimum-cost-flow-based allocator, and a global scene embedding generator to achieve accurate alignment even under large coordinate discrepancies. We further introduce ScanNet-SG, a large-scale dataset generated via an automated annotation pipeline with over 700k samples, covering 509 object categories from ScanNet labels and over 3k categories from GPT-4o-based tagging. Experiments show that our method achieves the best overall performance on both F2S and S2S tasks, substantially outperforming existing scene graph alignment methods. Our code and dataset are released at: https://autonomousrobots.nl/paper_websites/opensga.

View on arXiv PDF

Similar