CVRODec 16, 2025

DRAW2ACT: Turning Depth-Encoded Trajectories into Robotic Demonstration Videos

arXiv:2512.14217v12 citationsh-index: 55
Originality Incremental advance
AI Analysis

This work addresses the need for more controllable and consistent robotic demonstration videos for embodied AI, representing an incremental improvement over existing trajectory-conditioned methods.

The paper tackles the problem of limited controllability in video diffusion models for robotic manipulation by introducing DRAW2ACT, a depth-aware trajectory-conditioned framework that generates consistent RGB and depth videos, resulting in higher manipulation success rates on benchmarks like Bridge V2 and Berkeley Autolab.

Video diffusion models provide powerful real-world simulators for embodied AI but remain limited in controllability for robotic manipulation. Recent works on trajectory-conditioned video generation address this gap but often rely on 2D trajectories or single modality conditioning, which restricts their ability to produce controllable and consistent robotic demonstrations. We present DRAW2ACT, a depth-aware trajectory-conditioned video generation framework that extracts multiple orthogonal representations from the input trajectory, capturing depth, semantics, shape and motion, and injects them into the diffusion model. Moreover, we propose to jointly generate spatially aligned RGB and depth videos, leveraging cross-modality attention mechanisms and depth supervision to enhance the spatio-temporal consistency. Finally, we introduce a multimodal policy model conditioned on the generated RGB and depth sequences to regress the robot's joint angles. Experiments on Bridge V2, Berkeley Autolab, and simulation benchmarks show that DRAW2ACT achieves superior visual fidelity and consistency while yielding higher manipulation success rates compared to existing baselines.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes