ROCVJul 12, 2023

GVCCI: Lifelong Learning of Visual Grounding for Language-Guided Robotic Manipulation

arXiv:2307.05963v110 citationsh-index: 45Has Code
Originality Incremental advance
AI Analysis

This addresses the problem of costly human annotation and domain adaptation for robots in manipulation tasks, offering a significant but incremental advance in lifelong learning for robotics.

The paper tackles the performance drop in Language-Guided Robotic Manipulation (LGRM) due to domain gaps in pre-trained Visual Grounding (VG) models by proposing GVCCI, a lifelong learning framework that generates synthetic data without human supervision, resulting in up to 56.7% improvement in VG and up to 29.4% improvement in LGRM.

Language-Guided Robotic Manipulation (LGRM) is a challenging task as it requires a robot to understand human instructions to manipulate everyday objects. Recent approaches in LGRM rely on pre-trained Visual Grounding (VG) models to detect objects without adapting to manipulation environments. This results in a performance drop due to a substantial domain gap between the pre-training and real-world data. A straightforward solution is to collect additional training data, but the cost of human-annotation is extortionate. In this paper, we propose Grounding Vision to Ceaselessly Created Instructions (GVCCI), a lifelong learning framework for LGRM, which continuously learns VG without human supervision. GVCCI iteratively generates synthetic instruction via object detection and trains the VG model with the generated data. We validate our framework in offline and online settings across diverse environments on different VG models. Experimental results show that accumulating synthetic data from GVCCI leads to a steady improvement in VG by up to 56.7% and improves resultant LGRM by up to 29.4%. Furthermore, the qualitative analysis shows that the unadapted VG model often fails to find correct objects due to a strong bias learned from the pre-training data. Finally, we introduce a novel VG dataset for LGRM, consisting of nearly 252k triplets of image-object-instruction from diverse manipulation environments.

Code Implementations1 repo
Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes