LOSC: LiDAR Open-voc Segmentation Consolidator
This addresses the problem of open-vocabulary segmentation in autonomous driving by improving label quality, though it is incremental as it builds on existing back-projection techniques.
The paper tackles noisy and sparse point labels from image-based VLMs for open-vocabulary segmentation of LiDAR scans in driving settings by consolidating labels to enforce spatio-temporal consistency and robustness to image-level augmentations, then training a 3D network. The method, LOSC, outperforms SOTA on zero-shot open-vocabulary semantic and panoptic segmentation on nuScenes and SemanticKITTI with significant margins.
We study the use of image-based Vision-Language Models (VLMs) for open-vocabulary segmentation of lidar scans in driving settings. Classically, image semantics can be back-projected onto 3D point clouds. Yet, resulting point labels are noisy and sparse. We consolidate these labels to enforce both spatio-temporal consistency and robustness to image-level augmentations. We then train a 3D network based on these refined labels. This simple method, called LOSC, outperforms the SOTA of zero-shot open-vocabulary semantic and panoptic segmentation on both nuScenes and SemanticKITTI, with significant margins.