CVJan 14

OpenVoxel: Training-Free Grouping and Captioning Voxels for Open-Vocabulary 3D Scene Understanding

arXiv:2601.09575v13 citationsh-index: 4
Originality Incremental advance
AI Analysis

This addresses the need for efficient 3D scene understanding without training, benefiting researchers and applications in robotics or AR/VR, though it is incremental as it builds on existing voxel and vision-language models.

The authors tackled the problem of open-vocabulary 3D scene understanding by proposing OpenVoxel, a training-free algorithm that groups and captions sparse voxels from multi-view images, achieving superior performance in complex referring expression segmentation tasks compared to recent studies.

We propose OpenVoxel, a training-free algorithm for grouping and captioning sparse voxels for the open-vocabulary 3D scene understanding tasks. Given the sparse voxel rasterization (SVR) model obtained from multi-view images of a 3D scene, our OpenVoxel is able to produce meaningful groups that describe different objects in the scene. Also, by leveraging powerful Vision Language Models (VLMs) and Multi-modal Large Language Models (MLLMs), our OpenVoxel successfully build an informative scene map by captioning each group, enabling further 3D scene understanding tasks such as open-vocabulary segmentation (OVS) or referring expression segmentation (RES). Unlike previous methods, our method is training-free and does not introduce embeddings from a CLIP/BERT text encoder. Instead, we directly proceed with text-to-text search using MLLMs. Through extensive experiments, our method demonstrates superior performance compared to recent studies, particularly in complex referring expression segmentation (RES) tasks. The code will be open.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes