SceneSplat: Gaussian Splatting-based Scene Understanding with Vision-Language Pretraining
It addresses the lack of models for end-to-end semantic learning from 3D data alone, which is essential for comprehensive real-world 3D scene understanding, though it is incremental as it builds on existing 3DGS methods.
The paper tackles the problem of 3D scene understanding by introducing SceneSplat, the first large-scale approach that operates natively on 3D Gaussian Splatting (3DGS) for indoor scenes, and it demonstrates significant benefits over baselines on the new SceneSplat-7K dataset comprising 7916 scenes.
Recognizing arbitrary or previously unseen categories is essential for comprehensive real-world 3D scene understanding. Currently, all existing methods rely on 2D or textual modalities during training or together at inference. This highlights the clear absence of a model capable of processing 3D data alone for learning semantics end-to-end, along with the necessary data to train such a model. Meanwhile, 3D Gaussian Splatting (3DGS) has emerged as the de facto standard for 3D scene representation across various vision tasks. However, effectively integrating semantic reasoning into 3DGS in a generalizable manner remains an open challenge. To address these limitations, we introduce SceneSplat, to our knowledge the first large-scale 3D indoor scene understanding approach that operates natively on 3DGS. Furthermore, we propose a self-supervised learning scheme that unlocks rich 3D feature learning from unlabeled scenes. To power the proposed methods, we introduce SceneSplat-7K, the first large-scale 3DGS dataset for indoor scenes, comprising 7916 scenes derived from seven established datasets, such as ScanNet and Matterport3D. Generating SceneSplat-7K required computational resources equivalent to 150 GPU days on an L4 GPU, enabling standardized benchmarking for 3DGS-based reasoning for indoor scenes. Our exhaustive experiments on SceneSplat-7K demonstrate the significant benefit of the proposed method over the established baselines.