ROAICVLGOct 11, 2024

VLM See, Robot Do: Human Demo Video to Robot Action Plan via Vision Language Model

arXiv:2410.08792v247 citationsh-index: 9IROS
Originality Incremental advance
AI Analysis

This work addresses the challenge of translating human demonstrations into actionable robot plans, which is incremental as it builds on existing VLM applications in robotics.

The authors tackled the problem of generating robot task plans from human demonstration videos using a Vision Language Model (VLM), achieving superior performance in benchmarks against state-of-the-art baselines.

Vision Language Models (VLMs) have recently been adopted in robotics for their capability in common sense reasoning and generalizability. Existing work has applied VLMs to generate task and motion planning from natural language instructions and simulate training data for robot learning. In this work, we explore using VLM to interpret human demonstration videos and generate robot task planning. Our method integrates keyframe selection, visual perception, and VLM reasoning into a pipeline. We named it SeeDo because it enables the VLM to ''see'' human demonstrations and explain the corresponding plans to the robot for it to ''do''. To validate our approach, we collected a set of long-horizon human videos demonstrating pick-and-place tasks in three diverse categories and designed a set of metrics to comprehensively benchmark SeeDo against several baselines, including state-of-the-art video-input VLMs. The experiments demonstrate SeeDo's superior performance. We further deployed the generated task plans in both a simulation environment and on a real robot arm.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes