CVMay 21, 2023

VL-Fields: Towards Language-Grounded Neural Implicit Spatial Representations

arXiv:2305.12427v232 citations
Originality Incremental advance
AI Analysis

This work addresses the need for language-grounded scene understanding in robotics, offering a promising representation without prior object class knowledge, though it appears incremental as it builds on similar models like CLIP-Fields.

The paper tackled the problem of enabling open-vocabulary semantic queries in neural implicit spatial representations by introducing VL-Fields, which fuses scene geometry with vision-language features and outperformed CLIP-Fields by almost 10% in semantic segmentation.

We present Visual-Language Fields (VL-Fields), a neural implicit spatial representation that enables open-vocabulary semantic queries. Our model encodes and fuses the geometry of a scene with vision-language trained latent features by distilling information from a language-driven segmentation model. VL-Fields is trained without requiring any prior knowledge of the scene object classes, which makes it a promising representation for the field of robotics. Our model outperformed the similar CLIP-Fields model in the task of semantic segmentation by almost 10%.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes