GALA: Guided Attention with Language Alignment for Open Vocabulary Gaussian Splatting
This addresses the challenge of open-vocabulary 3D scene understanding for applications in 3D reconstruction and computer vision, representing an incremental advance with a novel method for a known bottleneck.
The paper tackles the problem of capturing fine-grained, language-aware 3D representations from 2D images by introducing GALA, a framework for open-vocabulary 3D scene understanding with 3D Gaussian Splatting, which demonstrates remarkable performance on both 2D and 3D queries in experiments on real-world datasets.
3D scene reconstruction and understanding have gained increasing popularity, yet existing methods still struggle to capture fine-grained, language-aware 3D representations from 2D images. In this paper, we present GALA, a novel framework for open-vocabulary 3D scene understanding with 3D Gaussian Splatting (3DGS). GALA distills a scene-specific 3D instance feature field via self-supervised contrastive learning. To extend to generalized language feature fields, we introduce the core contribution of GALA, a cross-attention module with two learnable codebooks that encode view-independent semantic embeddings. This design not only ensures intra-instance feature similarity but also supports seamless 2D and 3D open-vocabulary queries. It reduces memory consumption by avoiding per-Gaussian high-dimensional feature learning. Extensive experiments on real-world datasets demonstrate GALA's remarkable open-vocabulary performance on both 2D and 3D.