CVROMar 18

S-VAM: Shortcut Video-Action Model by Self-Distilling Geometric and Semantic Foresight

arXiv:2603.1619594.05 citationsh-index: 12
Predicted impact top 10% in CV · last 90 daysOriginality Incremental advance
AI Analysis

This addresses the need for efficient and high-fidelity video foresight in robot learning, though it appears incremental as it builds on existing VAM and diffusion model paradigms.

The paper tackles the problem of video action models (VAMs) being too slow or noisy for real-time robot manipulation by proposing S-VAM, which uses self-distillation to enable single-pass inference with coherent geometric and semantic foresight, outperforming state-of-the-art methods in simulation and real-world experiments.

Video action models (VAMs) have emerged as a promising paradigm for robot learning, owing to their powerful visual foresight for complex manipulation tasks. However, current VAMs, typically relying on either slow multi-step video generation or noisy one-step feature extraction, cannot simultaneously guarantee real-time inference and high-fidelity foresight. To address this limitation, we propose S-VAM, a shortcut video-action model that foresees coherent geometric and semantic representations via a single forward pass. Serving as a stable blueprint, these foreseen representations significantly simplify the action prediction. To enable this efficient shortcut, we introduce a novel self-distillation strategy that condenses structured generative priors of multi-step denoising into one-step inference. Specifically, vision foundation model (VFM) representations extracted from the diffusion model's own multi-step generated videos provide teacher targets. Lightweight decouplers, as students, learn to directly map noisy one-step features to these targets. Extensive experiments in simulation and the real world demonstrate that our S-VAM outperforms state-of-the-art methods, enabling efficient and precise manipulation in complex environments. Our project page is https://haodong-yan.github.io/S-VAM/

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes