CV AIJul 7, 2025

All in One: Visual-Description-Guided Unified Point Cloud Segmentation

Zongyan Han, Mohamed El Amine Boudjoghra, Jiahua Dong, Jinhong Wang, Rao Muhammad Anwer

arXiv:2507.05211v210.23 citationsh-index: 35Has Code

Originality Highly original

AI Analysis

This work addresses scene understanding for applications like robotics or autonomous driving by improving segmentation accuracy, though it is incremental as it builds on existing multimodal models.

The paper tackles the problem of unified 3D point cloud segmentation, which is hindered by sparse data and limited annotations, by proposing VDG-Uni3DSeg, a framework that integrates vision-language models and large language models to incorporate multimodal cues, achieving state-of-the-art results in semantic, instance, and panoptic segmentation.

Unified segmentation of 3D point clouds is crucial for scene understanding, but is hindered by its sparse structure, limited annotations, and the challenge of distinguishing fine-grained object classes in complex environments. Existing methods often struggle to capture rich semantic and contextual information due to limited supervision and a lack of diverse multimodal cues, leading to suboptimal differentiation of classes and instances. To address these challenges, we propose VDG-Uni3DSeg, a novel framework that integrates pre-trained vision-language models (e.g., CLIP) and large language models (LLMs) to enhance 3D segmentation. By leveraging LLM-generated textual descriptions and reference images from the internet, our method incorporates rich multimodal cues, facilitating fine-grained class and instance separation. We further design a Semantic-Visual Contrastive Loss to align point features with multimodal queries and a Spatial Enhanced Module to model scene-wide relationships efficiently. Operating within a closed-set paradigm that utilizes multimodal knowledge generated offline, VDG-Uni3DSeg achieves state-of-the-art results in semantic, instance, and panoptic segmentation, offering a scalable and practical solution for 3D understanding. Our code is available at https://github.com/Hanzy1996/VDG-Uni3DSeg.

View on arXiv PDF Code

Similar