ROAICVLGDec 17, 2025

mimic-video: Video-Action Models for Generalizable Robot Control Beyond VLAs

arXiv:2512.15692v253 citations
Originality Highly original
AI Analysis

This addresses the data burden in robotic manipulation for researchers and practitioners, offering a more efficient paradigm.

The paper tackles the problem of Vision-Language-Action Models (VLAs) requiring extensive robot data due to their lack of physical understanding, by introducing mimic-video, a Video-Action Model that uses pretrained video models to capture semantics and dynamics, achieving state-of-the-art performance with 10x sample efficiency and 2x convergence speed improvements.

Prevailing Vision-Language-Action Models (VLAs) for robotic manipulation are built upon vision-language backbones pretrained on large-scale, but disconnected static web data. As a result, despite improved semantic generalization, the policy must implicitly infer complex physical dynamics and temporal dependencies solely from robot trajectories. This reliance creates an unsustainable data burden, necessitating continuous, large-scale expert data collection to compensate for the lack of innate physical understanding. We contend that while vision-language pretraining effectively captures semantic priors, it remains blind to physical causality. A more effective paradigm leverages video to jointly capture semantics and visual dynamics during pretraining, thereby isolating the remaining task of low-level control. To this end, we introduce mimic-video, a novel Video-Action Model (VAM) that pairs a pretrained Internet-scale video model with a flow matching-based action decoder conditioned on its latent representations. The decoder serves as an Inverse Dynamics Model (IDM), generating low-level robot actions from the latent representation of video-space action plans. Our extensive evaluation shows that our approach achieves state-of-the-art performance on simulated and real-world robotic manipulation tasks, improving sample efficiency by 10x and convergence speed by 2x compared to traditional VLA architectures.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes