CVCLLGROIVOct 25, 2025

LOC: A General Language-Guided Framework for Open-Set 3D Occupancy Prediction

arXiv:2510.22141v14 citationsh-index: 1
Originality Incremental advance
AI Analysis

This addresses the problem of limited 3D data for open-set recognition in autonomous driving or robotics, but it appears incremental as it builds on existing occupancy networks and CLIP features.

The paper tackles the challenge of applying Vision-Language Models to 3D scene understanding by proposing LOC, a language-guided framework that supports supervised and self-supervised learning, achieving high-precision predictions for known classes and distinguishing unknown classes on the nuScenes dataset.

Vision-Language Models (VLMs) have shown significant progress in open-set challenges. However, the limited availability of 3D datasets hinders their effective application in 3D scene understanding. We propose LOC, a general language-guided framework adaptable to various occupancy networks, supporting both supervised and self-supervised learning paradigms. For self-supervised tasks, we employ a strategy that fuses multi-frame LiDAR points for dynamic/static scenes, using Poisson reconstruction to fill voids, and assigning semantics to voxels via K-Nearest Neighbor (KNN) to obtain comprehensive voxel representations. To mitigate feature over-homogenization caused by direct high-dimensional feature distillation, we introduce Densely Contrastive Learning (DCL). DCL leverages dense voxel semantic information and predefined textual prompts. This efficiently enhances open-set recognition without dense pixel-level supervision, and our framework can also leverage existing ground truth to further improve performance. Our model predicts dense voxel features embedded in the CLIP feature space, integrating textual and image pixel information, and classifies based on text and semantic similarity. Experiments on the nuScenes dataset demonstrate the method's superior performance, achieving high-precision predictions for known classes and distinguishing unknown classes without additional training data.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes