CVMMSep 21, 2024

CUS3D :CLIP-based Unsupervised 3D Segmentation via Object-level Denoise

arXiv:2409.13982v1
Originality Incremental advance
AI Analysis

This work addresses the difficulty of acquiring annotation labels in 3D data for researchers and practitioners in computer vision, but it appears incremental as it builds on existing CLIP-based methods by focusing on noise reduction.

The paper tackles the problem of noise in feature projection from 2D to 3D for unsupervised 3D segmentation by proposing CUS3D, a distillation learning framework with an object-level denoising module, resulting in advanced unsupervised and open-vocabulary segmentation performance.

To ease the difficulty of acquiring annotation labels in 3D data, a common method is using unsupervised and open-vocabulary semantic segmentation, which leverage 2D CLIP semantic knowledge. In this paper, unlike previous research that ignores the ``noise'' raised during feature projection from 2D to 3D, we propose a novel distillation learning framework named CUS3D. In our approach, an object-level denosing projection module is designed to screen out the ``noise'' and ensure more accurate 3D feature. Based on the obtained features, a multimodal distillation learning module is designed to align the 3D feature with CLIP semantic feature space with object-centered constrains to achieve advanced unsupervised semantic segmentation. We conduct comprehensive experiments in both unsupervised and open-vocabulary segmentation, and the results consistently showcase the superiority of our model in achieving advanced unsupervised segmentation results and its effectiveness in open-vocabulary segmentation.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes