CVMay 22

LangFlash: Feed-forward 3D Language Gaussian Splatting from Sparse Unposed Images

arXiv:2605.2328776.0
AI Analysis

This work addresses the problem of efficient, pose-free 3D scene reconstruction with language grounding, enabling low-latency multimodal scene understanding for generalizable 3D vision.

LangFlash is a feed-forward framework for 3D language Gaussian splatting that reconstructs 3D scenes with language-aligned semantic features from sparse unposed images in a single forward pass, achieving superior novel view synthesis and semantic consistency compared to previous methods.

We present LangFlash, a feed-forward framework for 3D Language Gaussian Splatting that reconstructs 3D scenes parameterized by Gaussian primitives enriched with language-aligned semantic features from sparse unposed multi-view images. Unlike optimization-based 3D methods, LangFlash directly predicts the geometry and semantics in a single forward pass, enabling low-latency 3D reconstruction and language-consistent scene understanding. To support large-scale training, we enriched the RealEstate10k dataset with coherent and dense semantic information for 3D semantic supervision. Furthermore, we propose a sparse semantic encoding scheme that combines a global semantic dictionary with locally varying per-primitive weights, preserving high-level linguistic information, while reducing representation complexity. Experimental results show that LangFlash achieves superior novel view synthesis and semantic consistency compared with previous methods. This study establishes a new paradigm for pose-free, language-grounded 3D scene reconstruction, advancing generalizable 3D vision and multimodal scene understanding. Demo is available at https://liylo.github.io/langflash.github.io/.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes