ROAIMar 29

ProgressVLA: Progress-Guided Diffusion Policy for Vision-Language Robotic Manipulation

arXiv:2603.2767091.81 citationsh-index: 4
AI Analysis

This work addresses the lack of progress awareness in VLA models for long-horizon robotic manipulation tasks, offering a differentiable progress guidance method that improves task success and generalization.

ProgressVLA introduces a progress-aware vision-language-action model for robotic manipulation, achieving a low prediction residual of 0.07 in simulation and zero-shot generalization to real-world tasks, with substantial improvements in success rates on CALVIN and LIBERO benchmarks.

Most existing vision-language-action (VLA) models for robotic manipulation lack progress awareness, typically relying on hand-crafted heuristics for task termination. This limitation is particularly severe in long-horizon tasks involving cascaded sub-goals. In this work, we investigate the estimation and integration of task progress, proposing a novel model named {\textbf \vla}. Our technical contributions are twofold: (1) \emph{robust progress estimation}: We pre-train a progress estimator on large-scale, unsupervised video-text robotic datasets. This estimator achieves a low prediction residual (0.07 on a scale of $[0, 1]$) in simulation and demonstrates zero-shot generalization to unseen real-world samples, and (2) \emph{differentiable progress guidance}: We introduce an inverse dynamics world model that maps predicted action tokens into future latent visual states. These latents are then processed by the progress estimator; by applying a maximal progress regularization, we establish a differentiable pipeline that provides progress-piloted guidance to refine action tokens. Extensive experiments on the CALVIN and LIBERO benchmarks, alongside real-world robot deployment, consistently demonstrate substantial improvements in success rates and generalization over strong baselines.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes