CV RODec 4, 2025

FASTer: Toward Efficient Autoregressive Vision Language Action Modeling via Neural Action Tokenization

Yicheng Liu, Shiduo Zhang, Zibin Dong, Baijun Ye, Tianyuan Yuan, Xiaopeng Yu, Linqi Yin, Chenhao Lu, Junhao Shi, Luca Jiang-Tao Yu, Liangtao Zheng, Tao Jiang

arXiv:2512.04952v216.413 citationsh-index: 8

Originality Highly original

AI Analysis

This addresses efficiency and generalization challenges in robot learning, representing an incremental improvement with novel method integration.

The paper tackles the trade-off between reconstruction fidelity and inference efficiency in autoregressive vision-language-action models for robotic manipulation by introducing FASTer, a framework with a learnable tokenizer and autoregressive policy, achieving faster inference and higher task performance than previous state-of-the-art models.

Autoregressive vision-language-action (VLA) models have recently demonstrated strong capabilities in robotic manipulation. However, their core process of action tokenization often involves a trade-off between reconstruction fidelity and inference efficiency. We introduce FASTer, a unified framework for efficient and generalizable robot learning that integrates a learnable tokenizer with an autoregressive policy built upon it. FASTerVQ encodes action chunks as single-channel images, capturing global spatio-temporal dependencies while maintaining a high compression ratio. FASTerVLA builds on this tokenizer with block-wise autoregressive decoding and a lightweight action expert, achieving both faster inference and higher task performance. Extensive experiments across simulated and real-world benchmarks show that FASTerVQ delivers superior reconstruction quality, high token utilization, and strong cross-task and cross-embodiment generalization, while FASTerVLA further improves overall capability, surpassing previous state-of-the-art VLA models in both inference speed and task performance.

View on arXiv PDF

Similar