Leveraging 2D-VLM for Label-Free 3D Segmentation in Large-Scale Outdoor Scene Understanding
This addresses the problem of costly 3D annotation for scene understanding, offering a label-free and flexible solution for applications like autonomous driving or robotics.
The paper tackles 3D semantic segmentation for large-scale outdoor point clouds without requiring annotated 3D data or paired images, achieving accuracy comparable to supervised methods and enabling open-vocabulary recognition.
This paper presents a novel 3D semantic segmentation method for large-scale point cloud data that does not require annotated 3D training data or paired RGB images. The proposed approach projects 3D point clouds onto 2D images using virtual cameras and performs semantic segmentation via a foundation 2D model guided by natural language prompts. 3D segmentation is achieved by aggregating predictions from multiple viewpoints through weighted voting. Our method outperforms existing training-free approaches and achieves segmentation accuracy comparable to supervised methods. Moreover, it supports open-vocabulary recognition, enabling users to detect objects using arbitrary text queries, thus overcoming the limitations of traditional supervised approaches.