CVDec 19, 2025

FLEG: Feed-Forward Language Embedded Gaussian Splatting from Any Views

arXiv:2512.17541v1h-index: 18
Originality Incremental advance
AI Analysis

This addresses the challenge of 3D scene understanding and semantic embedding for applications in robotics and AR/VR, though it is incremental as it builds on existing feed-forward and Gaussian splatting methods.

The paper tackles the problem of reconstructing language-embedded 3D Gaussians from arbitrary views without requiring 3D annotations, achieving efficient reconstruction with accurate geometry, high-fidelity appearance, and language-aligned semantics.

We present FLEG, a feed-forward network that reconstructs language-embedded 3D Gaussians from any views. Previous straightforward solutions combine feed-forward reconstruction with Gaussian heads but suffer from fixed input views and insufficient 3D training data. In contrast, we propose a 3D-annotation-free training framework for 2D-to-3D lifting from arbitrary uncalibrated and unposed multi-view images. Since the framework does not require 3D annotations, we can leverage large-scale video data with easily obtained 2D instance information to enrich semantic embedding. We also propose an instance-guided contrastive learning to align 2D semantics with the 3D representations. In addition, to mitigate the high memory and computational cost of dense views, we further propose a geometry-semantic hierarchical sparsification strategy. Our FLEG efficiently reconstructs language-embedded 3D Gaussian representation in a feed-forward manner from arbitrary sparse or dense views, jointly producing accurate geometry, high-fidelity appearance, and language-aligned semantics. Extensive experiments show that it outperforms existing methods on various related tasks. Project page: https://fangzhou2000.github.io/projects/fleg.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes