ROAICVFeb 25, 2024

RoboCodeX: Multimodal Code Generation for Robotic Behavior Synthesis

arXiv:2402.16117v150 citationsh-index: 63ICML
Originality Highly original
AI Analysis

This addresses the challenge of generalizing robotic behavior synthesis across various scenarios for Embodied AI, representing a novel method for a known bottleneck.

The paper tackles the problem of translating multimodal inputs into precise robotic control by proposing RoboCodeX, a tree-structured multimodal code generation framework that decomposes instructions into object-centric units with physical constraints, achieving state-of-the-art performance on manipulation and navigation tasks in simulators and real robots.

Robotic behavior synthesis, the problem of understanding multimodal inputs and generating precise physical control for robots, is an important part of Embodied AI. Despite successes in applying multimodal large language models for high-level understanding, it remains challenging to translate these conceptual understandings into detailed robotic actions while achieving generalization across various scenarios. In this paper, we propose a tree-structured multimodal code generation framework for generalized robotic behavior synthesis, termed RoboCodeX. RoboCodeX decomposes high-level human instructions into multiple object-centric manipulation units consisting of physical preferences such as affordance and safety constraints, and applies code generation to introduce generalization ability across various robotics platforms. To further enhance the capability to map conceptual and perceptual understanding into control commands, a specialized multimodal reasoning dataset is collected for pre-training and an iterative self-updating methodology is introduced for supervised fine-tuning. Extensive experiments demonstrate that RoboCodeX achieves state-of-the-art performance in both simulators and real robots on four different kinds of manipulation tasks and one navigation task.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes