ROLGDec 19, 2025

Vidarc: Embodied Video Diffusion Model for Closed-loop Control

arXiv:2512.17661v16 citationsh-index: 14
Originality Incremental advance
AI Analysis

This addresses the problem of high latency and insufficient grounding in closed-loop control for robotic manipulation, offering a domain-specific improvement.

The paper tackles robotic arm manipulation in data-scarce settings by introducing Vidarc, an autoregressive embodied video diffusion model with a masked inverse dynamics model, which achieves at least a 15% higher success rate and a 91% reduction in latency compared to state-of-the-art baselines in real-world deployment.

Robotic arm manipulation in data-scarce settings is a highly challenging task due to the complex embodiment dynamics and diverse contexts. Recent video-based approaches have shown great promise in capturing and transferring the temporal and physical interactions by pre-training on Internet-scale video data. However, such methods are often not optimized for the embodiment-specific closed-loop control, typically suffering from high latency and insufficient grounding. In this paper, we present Vidarc (Video Diffusion for Action Reasoning and Closed-loop Control), a novel autoregressive embodied video diffusion approach augmented by a masked inverse dynamics model. By grounding video predictions with action-relevant masks and incorporating real-time feedback through cached autoregressive generation, Vidarc achieves fast, accurate closed-loop control. Pre-trained on one million cross-embodiment episodes, Vidarc surpasses state-of-the-art baselines, achieving at least a 15% higher success rate in real-world deployment and a 91% reduction in latency. We also highlight its robust generalization and error correction capabilities across previously unseen robotic platforms.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes