CVROJul 1, 2025

GaussianVLM: Scene-centric 3D Vision-Language Models using Language-aligned Gaussian Splats for Embodied Reasoning and Beyond

arXiv:2507.00886v19 citationsh-index: 30IEEE Robot Autom Lett
Originality Highly original
AI Analysis

This addresses limitations in 3D vision-language models for embodied reasoning and beyond by reducing dependence on object detectors, offering a novel approach for more flexible and efficient scene understanding.

The paper tackles the problem of 3D scene understanding in vision-language models by proposing a scene-centric 3D VLM that embeds linguistic features into 3D Gaussian splat scenes, achieving a five-fold performance improvement over prior methods in out-of-domain settings.

As multimodal language models advance, their application to 3D scene understanding is a fast-growing frontier, driving the development of 3D Vision-Language Models (VLMs). Current methods show strong dependence on object detectors, introducing processing bottlenecks and limitations in taxonomic flexibility. To address these limitations, we propose a scene-centric 3D VLM for 3D Gaussian splat scenes that employs language- and task-aware scene representations. Our approach directly embeds rich linguistic features into the 3D scene representation by associating language with each Gaussian primitive, achieving early modality alignment. To process the resulting dense representations, we introduce a dual sparsifier that distills them into compact, task-relevant tokens via task-guided and location-guided pathways, producing sparse, task-aware global and local scene tokens. Notably, we present the first Gaussian splatting-based VLM, leveraging photorealistic 3D representations derived from standard RGB images, demonstrating strong generalization: it improves performance of prior 3D VLM five folds, in out-of-the-domain settings.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes