ROCVOct 11, 2022

CLIP-Fields: Weakly Supervised Semantic Fields for Robotic Memory

arXiv:2210.05663v3224 citationsh-index: 51
Originality Incremental advance
AI Analysis

This enables robots to perform semantic navigation in real-world environments with weakly supervised learning, representing an incremental advance in robotic memory systems.

The paper tackles the problem of building semantic scene models for robotics without direct human supervision by proposing CLIP-Fields, which learns a mapping from spatial locations to semantic embeddings using web-trained models like CLIP, and it outperforms baselines like Mask-RCNN on few-shot tasks in the HM3D dataset with only a fraction of the examples.

We propose CLIP-Fields, an implicit scene model that can be used for a variety of tasks, such as segmentation, instance identification, semantic search over space, and view localization. CLIP-Fields learns a mapping from spatial locations to semantic embedding vectors. Importantly, we show that this mapping can be trained with supervision coming only from web-image and web-text trained models such as CLIP, Detic, and Sentence-BERT; and thus uses no direct human supervision. When compared to baselines like Mask-RCNN, our method outperforms on few-shot instance identification or semantic segmentation on the HM3D dataset with only a fraction of the examples. Finally, we show that using CLIP-Fields as a scene memory, robots can perform semantic navigation in real-world environments. Our code and demonstration videos are available here: https://mahis.life/clip-fields

Code Implementations2 repos
Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes