RO AIMar 23

CaP-X: A Framework for Benchmarking and Improving Coding Agents for Robot Manipulation

Max Fu, Justin Yu, Karim El-Refai, Ethan Kou, Haoru Xue, Huang Huang, Wenli Xiao, Guanzhi Wang, Fei-Fei Li, Guanya Shi, Jiajun Wu, Shankar Sastry

arXiv:2603.2243576.25 citationsh-index: 16

Predicted impact top 20% in RO · last 90 daysOriginality Incremental advance

AI Analysis

This work addresses the challenge of making coding agents more robust and autonomous for robot manipulation, though it appears incremental by building on existing Code-as-Policy concepts.

The paper tackles the problem of evaluating and improving Code-as-Policy agents for robot manipulation, finding that performance improves with human-crafted abstractions but degrades without them, and that test-time computation can mitigate this gap, leading to a framework achieving human-level reliability in simulation and real-world tasks.

"Code-as-Policy" considers how executable code can complement data-intensive Vision-Language-Action (VLA) methods, yet their effectiveness as autonomous controllers for embodied manipulation remains underexplored. We present CaP-X, an open-access framework for systematically studying Code-as-Policy agents in robot manipulation. At its core is CaP-Gym, an interactive environment in which agents control robots by synthesizing and executing programs that compose perception and control primitives. Building on this foundation, CaP-Bench evaluates frontier language and vision-language models across varying levels of abstraction, interaction, and perceptual grounding. Across 12 models, CaP-Bench reveals a consistent trend: performance improves with human-crafted abstractions but degrades as these priors are removed, exposing a dependence on designer scaffolding. At the same time, we observe that this gap can be mitigated through scaling agentic test-time computation--through multi-turn interaction, structured execution feedback, visual differencing, automatic skill synthesis, and ensembled reasoning--substantially improves robustness even when agents operate over low-level primitives. These findings allow us to derive CaP-Agent0, a training-free framework that recovers human-level reliability on several manipulation tasks in simulation and on real embodiments. We further introduce CaP-RL, showing reinforcement learning with verifiable rewards improves success rates and transfers from sim2real with minimal gap. Together, CaP-X provides a principled, open-access platform for advancing embodied coding agents.

View on arXiv PDF

Similar