CVMar 14, 2025

EgoSplat: Open-Vocabulary Egocentric Scene Understanding with Language Embedded 3D Gaussian Splatting

Di Li, Jie Feng, Jiahao Chen, Weisheng Dong, Guanbin Li, Guangming Shi, Licheng Jiao

arXiv:2503.11345v111.84 citationsh-index: 39

Originality Incremental advance

AI Analysis

This work addresses the challenge of semantic inconsistencies and artifacts in egocentric scenes for applications like augmented reality and robotics, representing a strong specific gain rather than a foundational advancement.

The paper tackles the problem of open-vocabulary egocentric scene understanding, which involves frequent occlusions and dynamic interactions, by proposing EgoSplat, a language-embedded 3D Gaussian Splatting framework, achieving state-of-the-art performance with an 8.2% improvement in localization accuracy and a 3.7% improvement in segmentation mIoU on the ADT dataset.

Egocentric scenes exhibit frequent occlusions, varied viewpoints, and dynamic interactions compared to typical scene understanding tasks. Occlusions and varied viewpoints can lead to multi-view semantic inconsistencies, while dynamic objects may act as transient distractors, introducing artifacts into semantic feature modeling. To address these challenges, we propose EgoSplat, a language-embedded 3D Gaussian Splatting framework for open-vocabulary egocentric scene understanding. A multi-view consistent instance feature aggregation method is designed to leverage the segmentation and tracking capabilities of SAM2 to selectively aggregate complementary features across views for each instance, ensuring precise semantic representation of scenes. Additionally, an instance-aware spatial-temporal transient prediction module is constructed to improve spatial integrity and temporal continuity in predictions by incorporating spatial-temporal associations across multi-view instances, effectively reducing artifacts in the semantic reconstruction of egocentric scenes. EgoSplat achieves state-of-the-art performance in both localization and segmentation tasks on two datasets, outperforming existing methods with a 8.2% improvement in localization accuracy and a 3.7% improvement in segmentation mIoU on the ADT dataset, and setting a new benchmark in open-vocabulary egocentric scene understanding. The code will be made publicly available.

View on arXiv PDF

Similar