ROAIMar 5, 2024

MOKA: Open-World Robotic Manipulation through Mark-Based Visual Prompting

arXiv:2403.03174v3122 citationsh-index: 12Robotics: Science and Systems
Originality Incremental advance
AI Analysis

This work addresses the problem of open-world robotic manipulation for robotics researchers, offering a novel method that integrates VLMs with visual prompting, though it is incremental in building upon existing VLM capabilities.

The paper tackles the challenge of enabling robots to perform diverse manipulation tasks in open-world environments using free-form language instructions, by introducing MOKA, which leverages vision-language models to predict affordances and generate motions, achieving competitive performance on tasks like tool use and object rearrangement.

Open-world generalization requires robotic systems to have a profound understanding of the physical world and the user command to solve diverse and complex tasks. While the recent advancement in vision-language models (VLMs) has offered unprecedented opportunities to solve open-world problems, how to leverage their capabilities to control robots remains a grand challenge. In this paper, we introduce Marking Open-world Keypoint Affordances (MOKA), an approach that employs VLMs to solve robotic manipulation tasks specified by free-form language instructions. Central to our approach is a compact point-based representation of affordance, which bridges the VLM's predictions on observed images and the robot's actions in the physical world. By prompting the pre-trained VLM, our approach utilizes the VLM's commonsense knowledge and concept understanding acquired from broad data sources to predict affordances and generate motions. To facilitate the VLM's reasoning in zero-shot and few-shot manners, we propose a visual prompting technique that annotates marks on images, converting affordance reasoning into a series of visual question-answering problems that are solvable by the VLM. We further explore methods to enhance performance with robot experiences collected by MOKA through in-context learning and policy distillation. We evaluate and analyze MOKA's performance on various table-top manipulation tasks including tool use, deformable body manipulation, and object rearrangement.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes