CV ROSep 23, 2025

SGAligner++: Cross-Modal Language-Aided 3D Scene Graph Alignment

Binod Singh, Sayan Deb Sarkar, Iro Armeni

Stanford

arXiv:2509.20401v26.21 citationsh-index: 8

Originality Highly original

AI Analysis

This addresses the challenge of aligning partially overlapping 3D scene observations for applications like visual localization and navigation, representing a strong specific gain rather than a foundational advance.

The paper tackles the problem of aligning 3D scene graphs, which is crucial for robot navigation and embodied perception, by introducing SGAligner++, a cross-modal, language-aided framework that outperforms state-of-the-art methods by up to 40% on noisy real-world reconstructions.

Aligning 3D scene graphs is a crucial initial step for several applications in robot navigation and embodied perception. Current methods in 3D scene graph alignment often rely on single-modality point cloud data and struggle with incomplete or noisy input. We introduce SGAligner++, a cross-modal, language-aided framework for 3D scene graph alignment. Our method addresses the challenge of aligning partially overlapping scene observations across heterogeneous modalities by learning a unified joint embedding space, enabling accurate alignment even under low-overlap conditions and sensor noise. By employing lightweight unimodal encoders and attention-based fusion, SGAligner++ enhances scene understanding for tasks such as visual localization, 3D reconstruction, and navigation, while ensuring scalability and minimal computational overhead. Extensive evaluations on real-world datasets demonstrate that SGAligner++ outperforms state-of-the-art methods by up to 40% on noisy real-world reconstructions, while enabling cross-modal generalization.

View on arXiv PDF

Similar