CVROSep 23, 2025

SGAligner++: Cross-Modal Language-Aided 3D Scene Graph Alignment

Stanford
arXiv:2509.20401v21 citationsh-index: 8
Originality Highly original
AI Analysis

This addresses the challenge of aligning partially overlapping 3D scene observations for applications like visual localization and navigation, representing a strong specific gain rather than a foundational advance.

The paper tackles the problem of aligning 3D scene graphs, which is crucial for robot navigation and embodied perception, by introducing SGAligner++, a cross-modal, language-aided framework that outperforms state-of-the-art methods by up to 40% on noisy real-world reconstructions.

Aligning 3D scene graphs is a crucial initial step for several applications in robot navigation and embodied perception. Current methods in 3D scene graph alignment often rely on single-modality point cloud data and struggle with incomplete or noisy input. We introduce SGAligner++, a cross-modal, language-aided framework for 3D scene graph alignment. Our method addresses the challenge of aligning partially overlapping scene observations across heterogeneous modalities by learning a unified joint embedding space, enabling accurate alignment even under low-overlap conditions and sensor noise. By employing lightweight unimodal encoders and attention-based fusion, SGAligner++ enhances scene understanding for tasks such as visual localization, 3D reconstruction, and navigation, while ensuring scalability and minimal computational overhead. Extensive evaluations on real-world datasets demonstrate that SGAligner++ outperforms state-of-the-art methods by up to 40% on noisy real-world reconstructions, while enabling cross-modal generalization.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes