Bridging the Domain Gap: Self-Supervised 3D Scene Understanding with Foundation Models
This addresses the problem of limited 3D scene representation learning for researchers and practitioners in computer vision by bridging the domain gap with foundation models, though it is incremental as it builds on existing masked autoencoder and knowledge distillation techniques.
The paper tackles the domain gap in applying foundation models to 3D scene understanding by proposing Bridge3D, a method that uses features, semantic masks, and captions from foundation models to pre-train 3D models, resulting in a 6.3% improvement over baseline on the ScanNet dataset for 3D object detection and semantic segmentation.
Foundation models have achieved remarkable results in 2D and language tasks like image segmentation, object detection, and visual-language understanding. However, their potential to enrich 3D scene representation learning is largely untapped due to the existence of the domain gap. In this work, we propose an innovative methodology called Bridge3D to address this gap by pre-training 3D models using features, semantic masks, and captions sourced from foundation models. Specifically, our method employs semantic masks from foundation models to guide the masking and reconstruction process for the masked autoencoder, enabling more focused attention on foreground representations. Moreover, we bridge the 3D-text gap at the scene level using image captioning foundation models, thereby facilitating scene-level knowledge distillation. We further extend this bridging effort by introducing an innovative object-level knowledge distillation method that harnesses highly accurate object-level masks and semantic text data from foundation models. Our methodology significantly surpasses the performance of existing state-of-the-art methods in 3D object detection and semantic segmentation tasks. For instance, on the ScanNet dataset, Bridge3D improves the baseline by a notable margin of 6.3%. Code will be available at: https://github.com/Zhimin-C/Bridge3D