ROAIApr 23

From Noise to Intent: Anchoring Generative VLA Policies with Residual Bridges

arXiv:2604.2139196.7h-index: 3
AI Analysis

For embodied AI researchers, this work addresses the spatiotemporal scale mismatch in VLA policies by introducing a new paradigm that improves representation efficiency and condition alignment.

ResVLA shifts generative VLA policy from noise-to-action to intent-to-residual refinement, using spectral decomposition to separate global intent from local dynamics, achieving competitive performance, robustness to perturbations, and faster convergence in simulation and real-world robot tasks.

Bridging high-level semantic understanding with low-level physical control remains a persistent challenge in embodied intelligence, stemming from the fundamental spatiotemporal scale mismatch between cognition and action. Existing generative VLA policies typically adopt a "Generation-from-Noise" paradigm, which disregards this disparity, leading to representation inefficiency and weak condition alignment during optimization. In this work, we propose ResVLA, an architecture that shifts the paradigm to "Refinement-from-Intent." Recognizing that robotic motion naturally decomposes into global intent and local dynamics, ResVLA utilizes spectral analysis to decouple control into a deterministic low-frequency anchor and a stochastic high-frequency residual. By anchoring the generative process on the predicted intent, our model focuses strictly on refining local dynamics via a residual diffusion bridge. Extensive simulation experiments show that ResVLA achieves competitive performance, strong robustness to language and robot embodiment perturbations, and faster convergence than standard generative baselines. It also demonstrates strong performance in real-world robot experiments.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes