ROAIDec 23, 2025

Asynchronous Fast-Slow Vision-Language-Action Policies for Whole-Body Robotic Manipulation

arXiv:2512.20188v12 citationsh-index: 1
Originality Incremental advance
AI Analysis

This addresses real-time control and stability issues in whole-body robotic manipulation, offering a practical solution for commercial deployment, though it is incremental as it builds on existing VLA architectures.

The paper tackles the performance bottleneck in Vision-Language-Action (VLA) systems caused by synchronous execution of slow vision-language models and fast action experts, introducing an asynchronous framework (DuoCore-FS) that achieves 30 Hz action generation, three times faster than prior models, and improves task success rates and responsiveness in whole-body robotic manipulation.

Most Vision-Language-Action (VLA) systems integrate a Vision-Language Model (VLM) for semantic reasoning with an action expert generating continuous action signals, yet both typically run at a single unified frequency. As a result, policy performance is constrained by the low inference speed of large VLMs. This mandatory synchronous execution severely limits control stability and real-time performance in whole-body robotic manipulation, which involves more joints, larger motion spaces, and dynamically changing views. We introduce a truly asynchronous Fast-Slow VLA framework (DuoCore-FS), organizing the system into a fast pathway for high-frequency action generation and a slow pathway for rich VLM reasoning. The system is characterized by two key features. First, a latent representation buffer bridges the slow and fast systems. It stores instruction semantics and action-reasoning representation aligned with the scene-instruction context, providing high-level guidance to the fast pathway. Second, a whole-body action tokenizer provides a compact, unified representation of whole-body actions. Importantly, the VLM and action expert are still jointly trained end-to-end, preserving unified policy learning while enabling asynchronous execution. DuoCore-FS supports a 3B-parameter VLM while achieving 30 Hz whole-body action-chunk generation, approximately three times as fast as prior VLA models with comparable model sizes. Real-world whole-body manipulation experiments demonstrate improved task success rates and significantly enhanced responsiveness compared to synchronous Fast-Slow VLA baselines. The implementation of DuoCore-FS, including training, inference, and deployment, is provided to commercial users by Astribot as part of the Astribot robotic platform.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes