ROAIMar 10

AR-VLA: True Autoregressive Action Expert for Vision-Language-Action Models

arXiv:2603.10126v137.52 citationsh-index: 31
Predicted impact top 13% in RO · last 90 daysOriginality Incremental advance
AI Analysis

This work addresses the challenge of spatio-temporal consistency in robotic control for manipulation tasks, offering a modular and scalable solution, though it appears incremental as it builds on existing VLA frameworks.

The authors tackled the problem of generating temporally consistent actions in vision-language-action models by proposing an autoregressive action expert that maintains its own history, resulting in smoother action trajectories and matching or exceeding state-of-the-art task success rates.

We propose a standalone autoregressive (AR) Action Expert that generates actions as a continuous causal sequence while conditioning on refreshable vision-language prefixes. In contrast to existing Vision-Language-Action (VLA) models and diffusion policies that reset temporal context with each new observation and predict actions reactively, our Action Expert maintains its own history through a long-lived memory and is inherently context-aware. This structure addresses the frequency mismatch between fast control and slow reasoning, enabling efficient independent pretraining of kinematic syntax and modular integration with heavy perception backbones, naturally ensuring spatio-temporally consistent action generation across frames. To synchronize these asynchronous hybrid V-L-A modalities, we utilize a re-anchoring mechanism that mathematically accounts for perception staleness during both training and inference. Experiments on simulated and real-robot manipulation tasks demonstrate that the proposed method can effectively replace traditional chunk-based action heads for both specialist and generalist policies. AR-VLA exhibits superior history awareness and substantially smoother action trajectories while maintaining or exceeding the task success rates of state-of-the-art reactive VLAs. Overall, our work introduces a scalable, context-aware action generation schema that provides a robust structural foundation for training effective robotic policies.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes