FASTer: Toward Efficient Autoregressive Vision Language Action Modeling via Neural Action Tokenization
This addresses efficiency and generalization challenges in robot learning, representing an incremental improvement with novel method integration.
The paper tackles the trade-off between reconstruction fidelity and inference efficiency in autoregressive vision-language-action models for robotic manipulation by introducing FASTer, a framework with a learnable tokenizer and autoregressive policy, achieving faster inference and higher task performance than previous state-of-the-art models.
Autoregressive vision-language-action (VLA) models have recently demonstrated strong capabilities in robotic manipulation. However, their core process of action tokenization often involves a trade-off between reconstruction fidelity and inference efficiency. We introduce FASTer, a unified framework for efficient and generalizable robot learning that integrates a learnable tokenizer with an autoregressive policy built upon it. FASTerVQ encodes action chunks as single-channel images, capturing global spatio-temporal dependencies while maintaining a high compression ratio. FASTerVLA builds on this tokenizer with block-wise autoregressive decoding and a lightweight action expert, achieving both faster inference and higher task performance. Extensive experiments across simulated and real-world benchmarks show that FASTerVQ delivers superior reconstruction quality, high token utilization, and strong cross-task and cross-embodiment generalization, while FASTerVLA further improves overall capability, surpassing previous state-of-the-art VLA models in both inference speed and task performance.