CVAIRONov 26, 2024

g3D-LF: Generalizable 3D-Language Feature Fields for Embodied Tasks

arXiv:2411.17030v122 citationsh-index: 3CVPR
Originality Incremental advance
AI Analysis

This work addresses the challenge of enabling AI agents to understand and interact in 3D environments using language, with incremental improvements in representation learning for embodied tasks.

The paper tackles the problem of creating generalizable 3D-language representations for embodied AI tasks by introducing g3D-LF, which processes RGB-D images to predict novel views, generate BEV maps, and query targets with language, achieving significant advantages in navigation and question-answering tasks.

We introduce Generalizable 3D-Language Feature Fields (g3D-LF), a 3D representation model pre-trained on large-scale 3D-language dataset for embodied tasks. Our g3D-LF processes posed RGB-D images from agents to encode feature fields for: 1) Novel view representation predictions from any position in the 3D scene; 2) Generations of BEV maps centered on the agent; 3) Querying targets using multi-granularity language within the above-mentioned representations. Our representation can be generalized to unseen environments, enabling real-time construction and dynamic updates. By volume rendering latent features along sampled rays and integrating semantic and spatial relationships through multiscale encoders, our g3D-LF produces representations at different scales and perspectives, aligned with multi-granularity language, via multi-level contrastive learning. Furthermore, we prepare a large-scale 3D-language dataset to align the representations of the feature fields with language. Extensive experiments on Vision-and-Language Navigation under both Panorama and Monocular settings, Zero-shot Object Navigation, and Situated Question Answering tasks highlight the significant advantages and effectiveness of our g3D-LF for embodied tasks.

Code Implementations1 repo
Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes