CV ROMar 18

Action Draft and Verify: A Self-Verifying Framework for Vision-Language-Action Model

Chen Zhao, Zhuoran Wang, Haoyang Li, Shifeng Bao, Guanlin Li, Youhe Feng, Yang Li, Jie Tang, Jing Zhang

arXiv:2603.1809187.41 citationsh-index: 6

Predicted impact top 19% in CV · last 90 daysOriginality Incremental advance

AI Analysis

This addresses the challenge of efficient and accurate low-level control in embodied AI tasks, representing an incremental advancement by hybridizing existing methods.

The paper tackled the problem of improving Vision-Language-Action models by combining diffusion and auto-regressive paradigms to enhance robustness and generalization, resulting in a success rate improvement of +4.3 points in simulation and +19.7 points in real-world over a diffusion-based baseline.

Vision-Language-Action (VLA) models have recently demonstrated strong performance across embodied tasks. Modern VLAs commonly employ diffusion action experts to efficiently generate high-precision continuous action chunks, while auto-regressive generation can be slower and less accurate at low-level control. Yet auto-regressive paradigms still provide complementary priors that can improve robustness and generalization in out-of-distribution environments. To leverage both paradigms, we propose Action-Draft-and-Verify (ADV): diffusion action expert drafts multiple candidate action chunks, and the VLM selects one by scoring all candidates in a single forward pass with a perplexity-style metric. Under matched backbones, training data, and action-chunk length, ADV improves success rate by +4.3 points in simulation and +19.7 points in real-world over diffusion-based baseline, with a single-pass VLM reranking overhead.

View on arXiv PDF

Similar