CVMay 29, 2025

RoboTransfer: Geometry-Consistent Video Diffusion for Robotic Visual Policy Transfer

arXiv:2505.23171v18 citationsh-index: 14
Originality Incremental advance
AI Analysis

This addresses the sim-to-real gap in robotic manipulation by enabling cost-effective data synthesis, though it appears incremental as it builds on existing diffusion methods with geometry enhancements.

The paper tackles the problem of expensive real-world robot demonstration collection by introducing RoboTransfer, a diffusion-based video generation framework for robotic data synthesis, which achieves a 33.3% relative improvement in success rate in the DIFF-OBJ setting and a 251% relative improvement in the DIFF-ALL scenario.

Imitation Learning has become a fundamental approach in robotic manipulation. However, collecting large-scale real-world robot demonstrations is prohibitively expensive. Simulators offer a cost-effective alternative, but the sim-to-real gap make it extremely challenging to scale. Therefore, we introduce RoboTransfer, a diffusion-based video generation framework for robotic data synthesis. Unlike previous methods, RoboTransfer integrates multi-view geometry with explicit control over scene components, such as background and object attributes. By incorporating cross-view feature interactions and global depth/normal conditions, RoboTransfer ensures geometry consistency across views. This framework allows fine-grained control, including background edits and object swaps. Experiments demonstrate that RoboTransfer is capable of generating multi-view videos with enhanced geometric consistency and visual fidelity. In addition, policies trained on the data generated by RoboTransfer achieve a 33.3% relative improvement in the success rate in the DIFF-OBJ setting and a substantial 251% relative improvement in the more challenging DIFF-ALL scenario. Explore more demos on our project page: https://horizonrobotics.github.io/robot_lab/robotransfer

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes