SGAligner++: Cross-Modal Language-Aided 3D Scene Graph Alignment
This addresses the challenge of aligning partially overlapping 3D scene observations for applications like visual localization and navigation, representing a strong specific gain rather than a foundational advance.
The paper tackles the problem of aligning 3D scene graphs, which is crucial for robot navigation and embodied perception, by introducing SGAligner++, a cross-modal, language-aided framework that outperforms state-of-the-art methods by up to 40% on noisy real-world reconstructions.
Aligning 3D scene graphs is a crucial initial step for several applications in robot navigation and embodied perception. Current methods in 3D scene graph alignment often rely on single-modality point cloud data and struggle with incomplete or noisy input. We introduce SGAligner++, a cross-modal, language-aided framework for 3D scene graph alignment. Our method addresses the challenge of aligning partially overlapping scene observations across heterogeneous modalities by learning a unified joint embedding space, enabling accurate alignment even under low-overlap conditions and sensor noise. By employing lightweight unimodal encoders and attention-based fusion, SGAligner++ enhances scene understanding for tasks such as visual localization, 3D reconstruction, and navigation, while ensuring scalability and minimal computational overhead. Extensive evaluations on real-world datasets demonstrate that SGAligner++ outperforms state-of-the-art methods by up to 40% on noisy real-world reconstructions, while enabling cross-modal generalization.